Bug 9544 - Kernel oops in swsusp_save on suspend (IOMMU-related)
Summary: Kernel oops in swsusp_save on suspend (IOMMU-related)
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks: 7216
  Show dependency tree
 
Reported: 2007-12-11 05:30 UTC by Alexandre Julliard
Modified: 2009-04-02 04:55 UTC (History)
8 users (show)

See Also:
Kernel Version: 2.6.23.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output (21.08 KB, text/plain)
2007-12-11 14:10 UTC, Alexandre Julliard
Details

Description Alexandre Julliard 2007-12-11 05:30:33 UTC
Most recent kernel where this bug did not occur: n/a
Distribution: Debian sid
Hardware Environment: AMD athlon 64 X2 dual core with 4Gb RAM, chipset nForce 500 SLI
Software Environment: Linux wine.dyndns.org 2.6.23.8-gb506e24f-dirty #9 SMP Tue Dec 11 11:00:48 CET 2007 x86_64 GNU/Linux

Problem Description:
I'm getting a crash in swsusp_save() on suspend, when it tries to access
address 0xffff810008000000 (sorry I don't have the full oops, let me know if
you want me to copy it down by hand). This address is apparently the first page
in the GART IOMMU range:

PCI-DMA: Disabling AGP.
PCI-DMA: aperture base @ 8000000 size 65536 KB
PCI-DMA: using GART IOMMU.
PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture

My guess is that the IOMMU aperture range should somehow be skipped when
copying.

Suspending works fine if I boot with iommu=soft, or with mem=3G.
Comment 1 Andrew Morton 2007-12-11 12:14:00 UTC
So you haven't tested any kernel earlier than 2.6.23?

Yes, a copy of the oops would be great, please.  You shouldn't need to write
it down - try netconsole (Documentation/networking/netconsole.txt).

It's worth setting up netconsole...
Comment 2 Rafael J. Wysocki 2007-12-11 13:38:36 UTC
netconsole won't work at that point (devices suspended, interrupts disabled).  Serial console might be useful, though, but I doubt that box has a serial port.

I guess the problem is present in all kernels to date.

Alexandre, can you attach a dmesg output, please?
Comment 3 Alexandre Julliard 2007-12-11 14:10:13 UTC
Created attachment 13982 [details]
dmesg output

dmesg output attached.

I haven't tried other kernels, if that would be useful I could do it, any version you want me to try?

The box does have a serial port but I don't have anything to plug into it I'm afraid.
Comment 4 Ingo Molnar 2007-12-12 02:15:47 UTC
> netconsole won't work at that point (devices suspended, interrupts 
> disabled). Serial console might be useful, though, but I doubt that 
> box has a serial port.
> 
> I guess the problem is present in all kernels to date.

at least as long as netconsole output going _into_ suspend goes, i 
posted some really bad hacks to lkml some time ago that allow a 
per-device exclusion of the suspend sequence. (the suspend_disabled 
flag)

That way i was able to get a netconsole output far into the suspend, up 
to the point where we do the ACPI mmio command that physically suspends 
the CPU.

getting output from the system when it is coming out of resume is much 
harder. (but this crash is about going into the suspend, right?)
Comment 5 Rafael J. Wysocki 2007-12-12 15:30:32 UTC
(In reply to comment #4)
> 
> getting output from the system when it is coming out of resume is much 
> harder. (but this crash is about going into the suspend, right?)

Yes, but it happens in the middle of the "critical section" in which everything is supposed to be off, except for the CPU executing the code.  IOW, it's very much like a resume failure ...
Comment 6 Rafael J. Wysocki 2007-12-12 15:38:57 UTC
(In reply to comment #3)
> Created an attachment (id=13982) [details]
> dmesg output
> 
> dmesg output attached.

Thanks.

> I haven't tried other kernels, if that would be useful I could do it, any
> version you want me to try?

Hm, it looks like this problem has always been present ...

> The box does have a serial port but I don't have anything to plug into it I'm
> afraid.

I see.

Well, it seems that the IOMMU driver should mark the aperture as "nosave" for us (it overlaps with a memory area that the image-creating code considers as useable).

Did you try to enable the IOMMU option in the BIOS setup, BTW?
Comment 7 Alexandre Julliard 2007-12-13 01:41:16 UTC
(In reply to comment #6)

> Did you try to enable the IOMMU option in the BIOS setup, BTW?

There doesn't seem to be any way to configure IOMMU in my BIOS setup, or if there is one I couldn't find it... It's an ASUS M2N-E SLI, chipset nForce 500.
Comment 8 Rafael J. Wysocki 2007-12-13 07:28:32 UTC
OK, thanks.  Probably Asus doesn't think you'd need that.

Unfortunately, I'm not familiar with the IOMMU handling code, so I'm afraid it'll take some time to come up with a fix ...
Comment 9 Zhang Rui 2008-11-19 23:21:54 UTC
hmmm, what if you boot with iommu=off?
Comment 10 Alexandre Julliard 2008-11-22 06:57:58 UTC
Actually I retested and the bug is fixed now, most likely by commit 2050d45d7c32cbad7a070d04256237144a0920db.
Comment 11 Len Brown 2009-04-02 04:55:39 UTC
commit 2050d45d7c32cbad7a070d04256237144a0920db
Author: Pavel Machek <pavel@ucw.cz>
Date:   Thu Mar 13 23:05:41 2008 +0100

    x86: fix long standing bug with usb after hibernation with 4GB ram

shipped in 2.6.25-rc7

closed

Note You need to log in before you can comment on or make changes to this bug.