Bug 42716

Summary: Boot failure with KMS enabled (radeon)
Product: Drivers Reporter: Robby Workman (rw)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED INVALID    
Severity: normal CC: paulepanter
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.2.x Subsystem:
Regression: No Bisected commit-id:

Description Robby Workman 2012-02-02 03:11:21 UTC
Since 3.2.0, my machine [1] fails to boot - it crashes/panics in early boot when the radeon module (or one of its deps) is loaded. Graphics chip is as follows:

  VGA compatible controller: ATI Technologies Inc RS690 [Radeon X1200 Series]

This is all running on libdrm-2.4.29, mesa-7.11.2, xorg-server-1.11.4, and xf86-video-ati-6.14.3 with udev-180 (though the problem existed on 178 and 179 as well).

3.2.0, 3.2.1, and now 3.2.2 all exhibit this problem, but 3.1.x (including 3.1.10, which I'm currently using) is fine.  If I disable KMS (radeon.modeset=0), then the system boots fine.

I thought I'd be clever and bisect this, so I built about twelve or thirteen kernels, and bisect blamed this commit: 
  [50b8d257486a45cba7b65ca978986ed216bbcc10] ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race

That doesn't seem reasonable, but sure, it's worth a try.  I reversed that patch in mainline 3.2.2, and I get a good boot.  However, I notice that this is in /var/log/syslog:
  Feb  1 21:00:40 isotope kernel: [    2.039197] [drm:r100_ring_test] *ERROR* radeon: ring test failed (scratch(0x15E4)=0xCAFEDEAD)
  Feb  1 21:00:40 isotope kernel: [    2.039256] [drm:r100_cp_init] *ERROR* radeon: cp isn't working (-22).
  Feb  1 21:00:40 isotope kernel: [    2.162916] [drm:r100_cp_fini] *ERROR* Wait for CP idle timeout, shutting down CP.

I then booted the vanilla 3.2.2, and it booted just fine.  Okay, that's odd, so let's try the patched (reversed bisect patch from above) again.  This time, appeared to OOPS after switching over to KMS, but the display is still visible.  All of the potentially useful stuff scrolled off the screen, and now there's a constant repetition of the following two lines (with alternating values of CONNECTOR between 17 and 18) flooding the console every few seconds, so I can't pageup back to the OOPS/trace/whatever:

  [drm:radeon_atombios_connected_scratch_regs], DFP3 disconnected
  [drm:output_poll_execute], [CONNECTOR:18:DVI-D-1] status updated from 2 to 2


So, let's reboot to that kernel a few times.
#1: all is fine
#2: crash/panic after kms load; no video at all
#3: crash/panic after kms load; no video at all

This isn't going well, so let's boot the patched kernel again:
#1: all is fine
#2: all is fine
#3: crash/panic after kms load; no video at all

Any idea what's going on here?  I don't have the option to attach a serial console, for what that's worth, so that aside, any suggestions for further debugging?
Comment 1 Michel Dänzer 2012-02-02 11:59:30 UTC
(In reply to comment #0)
> Any idea what's going on here?

Given your mixed testing results, I suspect the bisect result is bogus; you probably need to at least test each kernel a couple of times before declaring it as good.


> I don't have the option to attach a serial console, for what that's
> worth, so that aside, any suggestions for further debugging?

You could try netconsole.
Comment 2 Robby Workman 2012-02-27 04:05:44 UTC
I'm marking this as invalid.  The machine I was seeing this on has died, and it started with all of the usb ports going bad.  From there, other parts stopped working over the course of several days.  I suspect a bad motherboard that was introducing faults, and 3.1.x was somehow working around them better than 3.2.x.  Who knows?  :/
Comment 3 Paul Menzel 2012-02-27 09:40:14 UTC
(In reply to comment #0)

[…]

> Any idea what's going on here?  I don't have the option to attach a serial
> console, for what that's worth, so that aside, any suggestions for further
> debugging?

Next time you can try netconsole. Although it has to be loaded before the radeon module. I guess a way to come around that issue is to build the module for the network card into the Linux kernel.

Additionally you could have started using a standard driver like vesafb(?) and then load the radeon module manually after the system booted.

Thanks and good luck with your new system.

[1] http://www.kernel.org/doc/Documentation/networking/netconsole.txt
Comment 4 Paul Menzel 2012-02-27 09:41:59 UTC
(In reply to comment #2)
> I'm marking this as invalid.  The machine I was seeing this on has died, and
> it
> started with all of the usb ports going bad.  From there, other parts stopped
> working over the course of several days.  I suspect a bad motherboard that
> was
> introducing faults, and 3.1.x was somehow working around them better than
> 3.2.x.  Who knows?  :/

Could you please document what machine and what board this was? In case others experience problems with the same machine it is quite helpful to know of quality issues.