Bug 197029
Summary: | intel_iommu=on breaks resume from suspend on several Thinkpad models | ||
---|---|---|---|
Product: | Drivers | Reporter: | Ronan (ronan.jouchet) |
Component: | IOMMU | Assignee: | Zhang Rui (rui.zhang) |
Status: | NEEDINFO --- | ||
Severity: | normal | CC: | aacid, abc, andersk, baolu.lu, bordjukov, charles, diego.viola, encrypto.soldier, f.v.claus, felash, frassl, hi, jarkko.sakkinen, kernel.org, m.bineder, mail, Michaelnussbaum08, nicolas.gruel, rafaeln.dev, richard.berger, ronan.jouchet, rui.zhang, uzytkownik2, vasyl.demin |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.8.7 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Email sent to linux-integrity@vger.kernel.org on 2020-09-06, including a digest of what we know and updated debug info as of kernel 5.8.7
ThinkPad X1 Carbon 4th 5.16.1, intel_iommu=on , enable_gvt=1, and suspend working patch to remove memory allocation in iommu_suspend() patch to remove memory allocation in iommu_suspend() |
Description
Ronan
2017-09-23 15:24:39 UTC
One more piece of information: over at the Arch bbs, someone suggested I try `intel_iommu=igfx_off` rather than full `intel_iommu=off`. It's not enough; even with `intel_iommu=igfx_off`, resume is broken. Same person on Arch bbs suggested my bug might be related to Bug 89360. (In reply to Ronan Jouchet from comment #1) > Same person on Arch bbs suggested my bug might be related to Bug 89360. Whoopsie, not a bug on this bugzilla, bug on freedesktop's one. I meant https://bugs.freedesktop.org/show_bug.cgi?id=89360 One more piece of info: user Bake_Jailey from reddit/Linux 4.13.3-1 is in core! [1] says: > Same on my X1 Yoga rev 1. Needed the full intel_iommu=off, > otherwise I was stuck suspended. [1] https://www.reddit.com/r/archlinux/comments/72z2rv/linux_41331_is_in_core/ Problem persists with the newly-released 4.13.4-1 Arch package, https://www.archlinux.org/packages/testing/x86_64/linux/ iommu@lists.linux-foundation.org mailing-list discussion: https://lists.linuxfoundation.org/pipermail/iommu/2017-September/024382.html News from the mailing-list discussion: * Raj and Lu from Intel will be having a look. Raj says [1]: > I suspect that after suspend, we do save some of the registers that might > lose context. But the driver needs to reinitialize the uarch states > again. for e.g. need to go through the set root table pointer commands > again. > > Although i see some an attempt to save some of the context, but > we aren't performing the ones like iommu_set_root_entry, > enable_translation etc again to reinitialize those states. * Rolling back to 4.12.13-1 and manually activating `intel_iommu=on`, resume is similarly broken. Thus, this bug has nothing specific to 4.13; `intel_iommu={on, igfx_on}` probably have been breaking resume for a long time on my system, I just happened to notice it now due to being `on` by default. * Also, not sure if this is related or a separate bug, but looking at my system logs I see two errors logged when `intel_iommu` is `on`: DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [00:12.4] fault addr b7fff000 [fault reason 02] Present bit in context entry is clear Feel free to ask for more info about that DMAR error, or ignore it if you know it's irrelevant / already tracked elsewhere. [1] https://lists.linuxfoundation.org/pipermail/iommu/2017-September/024383.html And for the record, on this Thinkad T450s, just starting X (startx) freezes the machine with 4.13.3. `intel_iommu=on,igfx_off` solves it. No resume from suspend involved. (In reply to Daniel Lublin from comment #7) > And for the record, on this Thinkad T450s, just starting X (startx) > freezes the machine with 4.13.3. `intel_iommu=on,igfx_off` solves it. > No resume from suspend involved. Looking at reports on r/Arch, there seems to be two common hard bugs with intel_iommu=on: a. This bug: resume from suspend broken b. X freezing the machine Let's not mix them; this bug is about a, not b. I'm not affected by b and cannot point you to a bug, but I think I saw some kernel bug mentions in one of the two threads about 4.13 landing in [core]. Similar problem here in a Thinkpad Yoga 460, my hunch is that it's not resume that is broken, seems more like suspend is broken for me. I say that because i am using intel_iommu=off and close the laptop lid, the "laptop is on" light fades away and then lights up again approximately in an interval of one second. Then opening the lid will make it resume fine. But if i have intel_iommu=on and close the laptop lid, , the "laptop is on" light fades away and then lights up much faster, maybe at 200ms interval or something, so that makes me think that it's not resume that is failing but suspend itself. I'm seeing this on my X1 Yoga (gen1) as well. When going to suspend (via systemctl suspend) with the default (intel_iommu=on), the power light starts fading/"breathing", but the audio mute LED stays on and the machine hangs. With intel_iommu=off, the power light breathes as well and the auto mute LED turns off correctly. I can then resume it normally (by pressing the Fn key). I can confirm the same issue on my Lenovo T460 running 4.13.4 (Arch). Not only did resume from suspend not work, but trying to restart or shutdown also freezes the system at the very end. intel_iommu=off solved both problems for me. (In reply to Richard Berger from comment #11) > I can confirm the same issue on my Lenovo T460 running 4.13.4 (Arch). > > Not only did resume from suspend not work, but trying to > restart or shutdown also freezes the system at the very end. > > intel_iommu=off solved both problems for me. By the way, a quick note to Arch passersby: the upcoming 4.13.5 kernel, currently in the [testing] repo, won't exhibit the problem, as intel_iommu is going to be disabled (as it was under 4.12). See [1], and grep for `iommu` in [2]. But (for now) it's not fixed. [1] https://git.archlinux.org/svntogit/packages.git/log/trunk?h=packages%2Flinux [2] https://git.archlinux.org/svntogit/packages.git/commit/trunk?h=packages%2Flinux&id=d8765edf559420ac826bd67491519a3bcf1beba9 This issue has been narrowed down to a hidden ME device which is not OS aware. The main symptom is below error log message and system fails to resume after being suspended. DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [00:12.4] fault addr b7fff000 [fault reason 02] Present bit in context entry is clear A quick workaround is make PTP OS aware in BIOS configuration. It's likely at "PCH-FW Configuration"->"PTP aware OS". (In reply to Lu Baolu from comment #13) > This issue has been narrowed down to a hidden ME device which is not OS > aware. The main symptom is below error log message and system fails to > resume after being suspended. > > DMAR: DRHD: handling fault status reg 3 > DMAR: [DMA Read] Request device [00:12.4] fault addr b7fff000 [fault > reason 02] Present bit in context entry is clear > > A quick workaround is make PTP OS aware in BIOS configuration. It's likely > at "PCH-FW Configuration"->"PTP aware OS". Hi Lu, thanks for the follow-up! Sadly, I couldn't find such an option in my T560's BIOS. Is a fix on the way? (In reply to Ronan Jouchet from comment #14) > (In reply to Lu Baolu from comment #13) > > This issue has been narrowed down to a hidden ME device which is not OS > > aware. The main symptom is below error log message and system fails to > > resume after being suspended. > > > > DMAR: DRHD: handling fault status reg 3 > > DMAR: [DMA Read] Request device [00:12.4] fault addr b7fff000 [fault > > reason 02] Present bit in context entry is clear > > > > A quick workaround is make PTP OS aware in BIOS configuration. It's likely > > at "PCH-FW Configuration"->"PTP aware OS". > > Hi Lu, thanks for the follow-up! > > Sadly, I couldn't find such an option in my T560's BIOS. > Is a fix on the way? Might be different option in T560's BIOS. Anyway, I am looking for a fix. I will update you if I have anything. The bios on thinkpad seems to be extremely basic and I do not find also the option on the T460 but thanks a lot for your help. (In reply to Lu Baolu from comment #15) > This issue has been narrowed down to a hidden ME device which > is not OS aware. The main symptom is below error log message > and system fails to resume after being suspended. > > DMAR: DRHD: handling fault status reg 3 > DMAR: [DMA Read] Request device [00:12.4] fault addr b7fff000 [fault > reason 02] Present bit in context entry is clear > > A quick workaround is make PTP OS aware in BIOS configuration. > It's likely at "PCH-FW Configuration"->"PTP aware OS". > > [...] > > I am looking for a fix. I will update you if I have anything. Hi Lu. Was reading about newly-released kernel 4.14, and the corresponding iommu pull: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4dfc2788033d30dfccfd4268e06dd73ce2c654ed Our iommu bug isn't fixed in 4.14, right? Anything new? Good day. This bug is still under investigation. We have narrowed it as a regression caused by a previous commit. The commit owner is now working on a fix. I just found the "intel_iommu=off" workaround in my GRUB config and decided to take a quick look how things look nowadays. The workaround isn't needed anymore (at least on Archlinux), since it doesn't set "CONFIG_INTEL_IOMMU_DEFAULT_ON". However, when explicitly setting intel_iommu=on, things seem even worse with 4.20.6: Going to standby ends up in some limbo mode, where the laptop seems in standby, but fans continue to spin. Also, waking it up is still impossible. (In reply to Lu Baolu from comment #18) > This bug is still under investigation. We have narrowed it as a regression > caused by a previous commit. The commit owner is now working on a fix. Lu, any news from this? This bug started affecting me again in Arch kernel 5.5.1.arch1-1, which is back to setting `CONFIG_INTEL_IOMMU_DEFAULT_ON=y` - full config at https://git.archlinux.org/svntogit/packages.git/tree/trunk/config?h=packages/linux I can provide updated debug information if what was provided in 2017 is not enough; please ask. It seems to be caused by below commit: commit 422eac3f7deae34dbaffd08e03e27f37a5394a56 Author: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Date: Tue Apr 19 12:54:18 2016 +0300 tpm_crb: fix mapping of the buffers On my Lenovo x250 the following situation occurs: [18697.813871] tpm_crb MSFT0101:00: can't request region for resource [mem 0xacdff080-0xacdfffff] The mapping of the control area overlaps the mapping of the command buffer. The control area is mapped over page, which is not right. It should mapped over sizeof(struct crb_control_area). Fixing this issue unmasks another issue. Command and response buffers can overlap and they do interleave on this machine. According to the PTP specification the overlapping means that they are mapped to the same buffer. The commit has been also on a Haswell NUC where things worked before applying this fix so that the both code paths for response buffer initialization are tested. Cc: stable@vger.kernel.org Fixes: 1bd047be37d9 ("tpm_crb: Use devm_ioremap_resource") Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> (In reply to Lu Baolu from comment #21) > It seems to be caused by below commit: > > commit 422eac3f7deae34dbaffd08e03e27f37a5394a56 > Author: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> > Date: Tue Apr 19 12:54:18 2016 +0300 > > tpm_crb: fix mapping of the buffers > > On my Lenovo x250 the following situation occurs: > > [18697.813871] tpm_crb MSFT0101:00: can't request region for resource > [mem 0xacdff080-0xacdfffff] > > The mapping of the control area overlaps the mapping of the command > buffer. The control area is mapped over page, which is not right. It > should mapped over sizeof(struct crb_control_area). > > Fixing this issue unmasks another issue. Command and response buffers > can overlap and they do interleave on this machine. According to the PTP > specification the overlapping means that they are mapped to the same > buffer. > > The commit has been also on a Haswell NUC where things worked before > applying this fix so that the both code paths for response buffer > initialization are tested. > > Cc: stable@vger.kernel.org > Fixes: 1bd047be37d9 ("tpm_crb: Use devm_ioremap_resource") > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> > Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Thanks for the fast feedback, Lu :) . I'm only a kernel user, though, and I don't understand whether that means resume from suspend is supposed to be fixed now, or not. Is it expected that 5.5.1 is still broken? If yes, is a fix coming? I have no idea about how this commit impacts the suspend/resume. The fast way to make it work is to revert this commit, or contact the commit author to rework it. Okay, added Jarkko in Cc of this bug. Hi Jarkko. Bringing Comment 20 and following to your attention, where Lu says you might have an idea about that bug. Can both of you discuss a fix? Feel free to ask for more debug information as necessary, as the initial report is from 2017 and might need an update. (In reply to Lu Baolu from comment #23) > I have no idea about how this commit impacts the suspend/resume. The fast > way to make it work is to revert this commit, or contact the commit author > to rework it. I have this problem too. Do you want me to confirm that reverting this commit fixes it or you are sure about it already? (In reply to Albert Astals Cid from comment #25) > (In reply to Lu Baolu from comment #23) > > I have no idea about how this commit impacts the suspend/resume. The fast > > way to make it work is to revert this commit, or contact the commit author > > to rework it. > > I have this problem too. > > Do you want me to confirm that reverting this commit fixes it or you are > sure about it already? Yes please, we're not sure about it. Thanks! Unfortunately that patch does not cleanly apply as revert to the current kenel code, so i've no idea what to do. Can someone provide a patch they want me to try? Lu, Jarkko, or anyone working at Intel on iommu: can one of you followup on Comment 21, or options to confirm that commit 422eac3f7deae34dbaffd08e03e27f37a5394a56 is the culprit? The above commenter attempted to test a revert of this commit, but reverting against the current kernel code failed, and we don't know what else to do. Is there anything else we can do to help? Can you Cc other Intel folks potentially able to help? Thanks. I have a Lenovo x201, installed Kali Linux, checked ''dmesg'' and saw the same output: [drm:i915_gem_gtt_finish_pages [i915]] *ERROR* Failed to wait for idle; VT'd may hang. [drm:i915_gem_gtt_finish_pages [i915]] *ERROR* Failed to wait for idle; VT'd may hang. At first I added the ''intel_iommu=off'' to the linux boot line in ''grub.cfg'' file. That stopped the issue. I then changed the line a little, turned 'iommu on but igfx off like this 'intel_iommu=on,igfx_off'. That also worked leaving 'iommu' on but turning 'igfx_off' But I wanted to narrow down the issue. Since its got 'VT'd' in the dmesg output, to check I deleted the 'intel_iommu=on,igfx_off' line from the 'grub.cfg' file, rebooted, checked dmesg, the error was there again, so I went inside the x201 BIOS, under ‘Config’-> CPU -> Intel VT -d Feature -> and disabled the setting, rebooted, checked 'dmesg' and the error was gone. If you don't need this 'Intel VT' feature for virtual machines on your laptop, then disabling 'Intel VT' is not a problem. So disabling 'Intel VT -d' in the BIOS resolved the issue. I believe I have the same issue with various kernels (5.4-5.8) on a Google Pixelbook. I applied the following diff, which as far as I can tell should be equivalent to reverting 422eac3f7deae34dbaffd08e03e27f37a5394a56. However, this did not solve the problem for me. So I don’t think 422eac3f7dea is the culprit. It would probably still be useful for other people to test, though. --- diff --git a/drivers/char/tpm/tpm_crb.c b/drivers/char/tpm/tpm_crb.c index a9dcf31eadd21..3706c4250bd0f 100644 --- a/drivers/char/tpm/tpm_crb.c +++ b/drivers/char/tpm/tpm_crb.c @@ -548,7 +548,7 @@ static int crb_map_io(struct acpi_device *device, struct crb_priv *priv, } priv->regs_t = crb_map_res(dev, iores, iobase_ptr, buf->control_address, - sizeof(struct crb_regs_tail)); + 0x1000); if (IS_ERR(priv->regs_t)) return PTR_ERR(priv->regs_t); @@ -625,12 +625,12 @@ static int crb_map_io(struct acpi_device *device, struct crb_priv *priv, if (iores) rsp_size = crb_fixup_cmd_size(dev, iores, rsp_pa, rsp_size); - if (cmd_pa != rsp_pa) { + /* if (cmd_pa != rsp_pa) { */ priv->rsp = crb_map_res(dev, iores, iobase_ptr, rsp_pa, rsp_size); ret = PTR_ERR_OR_ZERO(priv->rsp); goto out; - } + /* } */ /* According to the PTP specification, overlapping command and response * buffer sizes must be identical. (In reply to Ronan from comment #28) > Lu, Jarkko, or anyone working at Intel on iommu: can one of you followup on > Comment 21, or options to confirm that commit > 422eac3f7deae34dbaffd08e03e27f37a5394a56 is the culprit? > > The above commenter attempted to test a revert of this commit, but reverting > against the current kernel code failed, and we don't know what else to do. > > Is there anything else we can do to help? Can you Cc other Intel folks > potentially able to help? Thanks. Please email to linux-integrity@vger.kernel.org with: 1. Cc to me. 2. Relevant information. I don't think Bugzilla scales to progress further in discussion about root cause. Especially, this is an issue because my iommu knowledge is limited. VGER lists are the only guaranteed place to have a full developer reach. Created attachment 292391 [details] Email sent to linux-integrity@vger.kernel.org on 2020-09-06, including a digest of what we know and updated debug info as of kernel 5.8.7 (In reply to jarkko.sakkinen from comment #31) > Please email to linux-integrity@vger.kernel.org with: > > 1. Cc to me. > 2. Relevant information. > > I don't think Bugzilla scales to progress further in discussion about root > cause. Especially, this is an issue because my iommu knowledge is limited. > VGER lists are the only guaranteed place to have a full developer reach. Jarkko, I'm emailing now the attached summary to linux-integrity@vger.kernel.org , CCed to you. I remain available to provide more debug info, but I'll appreciate your help pushing this forward; this is way out of my knowledge zone. Thanks for the follow-up, hope we sort this out someday! Created attachment 300287 [details]
ThinkPad X1 Carbon 4th 5.16.1, intel_iommu=on , enable_gvt=1, and suspend working
After upgrading firmware and upbraiding to 5.16.1 and changing bios setting, I got things working.
Changing The security chip from PTT to tpm, fixed the tpm0 Time out DMAR problem's.
[ 0.242713] DMAR: DRHD: handling fault status reg 3
[ 0.242719] DMAR: [DMA Read NO_PASID] Request device [00:12.4] fault addr 0xaffff000 [fault reason 0x02] Present bit in context entry is clear
[ 0.688154] Freeing initrd memory: 39528K
[ 1.230534] tsc: Refined TSC clocksource calibration: 2808.006 MHz
[ 1.230559] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2879cbf0535, max_idle_ns: 440795305184 ns
[ 1.230617] clocksource: Switched to clocksource tsc
[ 4.654935] tpm tpm0: Operation Timed out
[ 11.478928] tpm tpm0: Operation Timed out
[ 11.478950] tpm_crb: probe of MSFT0101:00 failed with error -62
Just for the sake of completeness, i also enabled Intel SGX, SMBIOS. I disabled legacy uefi boot and intel amt control.
S3 sleep and wakeup are also working after this fix.
someone was suggesting that the wakeup is not broken but rather the suspend it self.
I think i can confirm this. The Mic-mute led is not turning off when suspend f. up.
When suspend is working the led is turning off while suspendet.
Intel_iommu and GVT-g graphics virtualization and suspend/wakeup .... working
FWIW I'm still seeing this in 6.1.7 from the Debian package in bookworm (this is the current "testing"). This is how this kernel is configured for iommu: ``` # CONFIG_INTEL_IOMMU_DEFAULT_ON is not set CONFIG_INTEL_IOMMU_DEFAULT_ON_INTGPU_OFF=y # CONFIG_INTEL_IOMMU_DEFAULT_OFF is not set ``` Adding iommu=off to my grub fixes S3 suspend like others have already reported here, but I don't understand really what the consequences of turning it off are. This is a Thinkpad X1 Carbon Gen 4. Created attachment 305105 [details]
patch to remove memory allocation in iommu_suspend()
Please check if this patch solves the problem or not.
After a double check, it seems that this is a different problem from what the patch fixes. Please ignore it. Created attachment 305107 [details]
patch to remove memory allocation in iommu_suspend()
update the patch changelog message, although it is not targeted for current problem.
|