Bug 17692

Summary: Intermittent failure to resume from suspend-to-ram in radeon_bo_get_surface_reg with RV515 and ColorTiling
Product: Drivers Reporter: Karl Tomlinson (bugs+kernel)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, alexdeucher
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35.4 Subsystem:
Regression: No Bisected commit-id:

Description Karl Tomlinson 2010-09-02 23:46:10 UTC
TRACE_RESUMEs around this line indicate that it gets as far as radeon_resume,
but not as far as radeon_pm_resume:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.35.y.git;a=blob;f=drivers/gpu/drm/radeon/radeon_device.c;h=a7184636dcb494b36d470af30c2b513382d38019;hb=1506707a6c740db316e422239a53ae5df1727591#l798

Adding additional TRACE_RESUMEs earlier in radeon_resume_kms (and using them)
seems to make failure less frequent, as if there is a timing issue involved.

Possibly interesting messages on startup:

[drm] initializing kernel modesetting (RV515 0x1002:0x7145).
ATOM BIOS: M54CSP/M52CSP
[drm] Loading R500 Microcode
[drm] Initialized radeon 2.5.0 20080528 for 0000:01:00.0 on minor 0

and during successful resume:

radeon 0000:01:00.0: power state changed by ACPI to D0
radeon 0000:01:00.0: power state changed by ACPI to D0
radeon 0000:01:00.0: setting latency timer to 64
radeon 0000:01:00.0: ffff8800bfba5e00 unpin not necessary
[drm] radeon: 1 quad pipes, 1 z pipes initialized.
[drm] PCIE GART of 512M enabled (table at 0x00040000).
[drm] radeon: ring at 0x0000000008000000
[drm] ring test succeeded in 2 usecs
[drm] ib test succeeded in 0 usecs

ata1: EH complete
[drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 1sec aborting
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing EC38 (len 86,
WS 4, PS 0) @ 0xEC6B
PM: resume of devices complete after 1660.086 msecs

http://bugs.freedesktop.org/show_bug.cgi?id=27744

I have radeon.dynclks=1 on the command line.
I had intermittent resume failures even without this, though I haven't
confirmed that the failure is at the same point.
Comment 1 Karl Tomlinson 2010-09-02 23:46:44 UTC
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon Mobility X1400 (prog-if 00 [VGA controller])
        Subsystem: Lenovo Thinkpad T60 model 2007
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 45
        Region 0: Memory at d8000000 (32-bit, prefetchable) [size=128M]
        Region 1: I/O ports at 2000 [size=256]
        Region 2: Memory at ee100000 (32-bit, non-prefetchable) [size=64K]
        [virtual] Expansion ROM at ee120000 [disabled] [size=128K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v1) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0100c  Data: 4189
        Kernel driver in use: radeon
Comment 3 Karl Tomlinson 2010-09-05 21:31:09 UTC
With that patch, I'm still seeing the same "stuck in loop for more than 5sec" messages on successful resume, with the same instruction and address.
I've only resumed once (successfully), so don't know yet whether failure is less frequent.
Comment 4 Karl Tomlinson 2010-09-10 05:56:25 UTC
The patch in comment 2 hasn't helped.

More TRACE_RESUMEs indicate that resume gets as far as
radeon_surface_init() from rv515_resume().
Comment 5 Karl Tomlinson 2010-09-19 23:50:03 UTC
The first radeon_bo_get_surface_reg fails to complete.  i = 0 at

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.35.y.git;a=blob;f=drivers/gpu/drm/radeon/radeon_device.c;h=a7184636dcb494b36d470af30c2b513382d38019;hb=1506707a6c740db316e422239a53ae5df1727591#l98

I'll try disabling ColorTiling in the server.

I had a week of successful resumes while I had no dock or external monitor,
so I wonder whether this might be related, though it's difficult to verify given the intermittent nature.  I thought I once had a similar resume failure without dock or external monitor, but that would have been after suspending with dock and monitor attached.
Comment 6 Karl Tomlinson 2010-10-07 22:19:50 UTC
Yes, seems that switching off ColorTiling works around the issue.
Comment 7 Karl Tomlinson 2010-10-07 22:21:46 UTC
FWIW, I'm still seeing "I'm still seeing the same "stuck in loop for more than 5sec" messages, so that seems unrelated.
Comment 8 Alan 2012-08-13 16:24:49 UTC
If this is still seen with a modern kernel please update/re-open thanks