Bug 38352 - 3.0-0.rc4.git3.1.fc16.x86_64 radeon GPU lockup then crash
3.0-0.rc4.git3.1.fc16.x86_64 radeon GPU lockup then crash
Status: CLOSED INVALID
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel)
All Linux
: P1 normal
Assigned To: drivers_video-dri
:
Depends on:
Blocks: 36912
  Show dependency treegraph
 
Reported: 2011-06-27 21:11 UTC by Nicolas Mailhot
Modified: 2011-07-10 09:40 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.0-rc4
Tree: Fedora
Regression: Yes


Attachments
xorg logs (56.23 KB, text/plain)
2011-06-27 21:11 UTC, Nicolas Mailhot
Details
system logs (302.07 KB, application/zip)
2011-06-27 21:22 UTC, Nicolas Mailhot
Details

Description Nicolas Mailhot 2011-06-27 21:11:54 UTC
Created attachment 63572 [details]
xorg logs

Something in last weekend's rawhide made this system HD 4850 real unhappy

It seems the default mode is now broken :

1. in bios/grub either the screen will refuse the mode or if it accepts it will
show the screens with various ascii characters moving around in white/green

2. in the kernel framebuffer: mode change works, but green (with magenta
tinges) horizontal dot traces move in a blocky pattern around the screen

3. in xorg : trying to launch xorg (either via gdm or startx) locks up the
system

Attaching various system logs, rescued via ssh from a windows box (so don't
expect any filtering, raw logs only)

I don't have the free time to update daily lately and each crash costs hours of
md resync so the breakage may have been introduced in earlier versions (or
maybe it was always here and rawhide gnome-shell just poke where it should not)

System is up-to-date as of now, koji builds included (everything that yum
accepted to update)

There have been a cold spell here so temperatures were under average when the
system broke and it was not overheating (sensors report 75°C for the GPU, well
below what it can take). memtest86 passes with no errors (single pass
yesterday)

The 65.3 kHz EDID mode looks suspicious (the resolution is right)

Jun 27 22:19:32 arekh kernel: [  130.771039] CP stall for more than 10020msec
Jun 27 22:19:32 arekh kernel: [  130.771043] ------------[ cut here
]------------
Jun 27 22:19:32 arekh kernel: [  130.771067] WARNING: at
drivers/gpu/drm/radeon/radeon_fence.c:267 radeon_fence_wait+0x296/0x357
[radeon]()
Jun 27 22:19:32 arekh kernel: [  130.771070] Hardware name: EP45-DS5
Jun 27 22:19:32 arekh kernel: [  130.771073] GPU lockup (waiting for 0x00000006
last fence id 0x00000003)
Jun 27 22:19:32 arekh kernel: [  130.771075] Modules linked in: fuse ppdev
parport_pc lp parport it87 hwmon_vid coretemp ip6t_REJECT nf_conntrack_ipv6
ip6table_filter xt_state ipt_MASQUERADE ipt_REDIRECT xt_owner iptable_nat
nf_nat nf_conntrack_ipv4 nf_conntrack xt_TPROXY nf_tproxy_core xt_socket
nf_defrag_ipv4 ip6_tables nf_defrag_ipv6 xt_mark iptable_mangle
cpufreq_ondemand acpi_cpufreq freq_table mperf snd_emu10k1_synth snd_emux_synth
snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul tuner_simple tuner_types
wm8775 tda9887 tda8290 tuner cx25840 joydev snd_emu10k1 microcode
snd_hda_codec_hdmi ivtv cx2341x i2c_i801 snd_rawmidi snd_hda_intel
snd_hda_codec snd_ac97_codec v4l2_common serio_raw videodev pcspkr media
snd_seq v4l2_compat_ioctl32 tveeprom ac97_bus iTCO_wdt iTCO_vendor_support
snd_seq_device snd_util_mem snd_hwdep snd_pcm snd_timer snd soundcore
snd_page_alloc r8169 mii uinput raid1 firewire_ohci pata_acpi firewire_core
ata_generic crc_itu_t pata_jmicron radeon ttm drm_kms_helper drm i2c_algo_bit i
Jun 27 22:19:32 arekh kernel: 2c_core [last unloaded: scsi_wait_scan]
Jun 27 22:19:32 arekh kernel: [  130.771168] Pid: 1698, comm: X Not tainted
3.0-0.rc4.git3.1.fc16.x86_64 #1
Jun 27 22:19:32 arekh kernel: [  130.771170] Call Trace:
Jun 27 22:19:32 arekh kernel: [  130.771178]  [<ffffffff81057b14>]
warn_slowpath_common+0x83/0x9b
Jun 27 22:19:32 arekh kernel: [  130.771182]  [<ffffffff81057bcf>]
warn_slowpath_fmt+0x46/0x48
Jun 27 22:19:32 arekh kernel: [  130.771210]  [<ffffffffa00b6ff3>] ?
r600_gpu_is_lockup+0xbd/0xc6 [radeon]
Jun 27 22:19:32 arekh kernel: [  130.771230]  [<ffffffffa0090f45>]
radeon_fence_wait+0x296/0x357 [radeon]
Jun 27 22:19:32 arekh kernel: [  130.771235]  [<ffffffff81074d54>] ?
__init_waitqueue_head+0x4b/0x4b
Jun 27 22:19:32 arekh kernel: [  130.771258]  [<ffffffffa009155a>]
radeon_sync_obj_wait+0x11/0x13 [radeon]
Jun 27 22:19:32 arekh kernel: [  130.771267]  [<ffffffffa004b6a0>]
ttm_bo_wait+0xbd/0x179 [ttm]
Jun 27 22:19:32 arekh kernel: [  130.771292]  [<ffffffffa00a1ef1>]
radeon_bo_wait+0x7b/0xa4 [radeon]
Jun 27 22:19:32 arekh kernel: [  130.771317]  [<ffffffffa00a245a>]
radeon_gem_wait_idle_ioctl+0x3d/0x70 [radeon]
Jun 27 22:19:32 arekh kernel: [  130.771331]  [<ffffffffa001a8e2>]
drm_ioctl+0x2a4/0x386 [drm]
Jun 27 22:19:32 arekh kernel: [  130.771355]  [<ffffffffa00a241d>] ?
radeon_gem_busy_ioctl+0x86/0x86 [radeon]
Jun 27 22:19:32 arekh kernel: [  130.771361]  [<ffffffff812108ff>] ?
inode_has_perm+0x6a/0x77
Jun 27 22:19:32 arekh kernel: [  130.771365]  [<ffffffff812109b3>] ?
file_has_perm+0xa7/0xc9
Jun 27 22:19:32 arekh kernel: [  130.771370]  [<ffffffff811463dc>]
do_vfs_ioctl+0x47b/0x4bc
Jun 27 22:19:32 arekh kernel: [  130.771374]  [<ffffffff81146473>]
sys_ioctl+0x56/0x7b
Jun 27 22:19:32 arekh kernel: [  130.771379]  [<ffffffff814f9e82>]
system_call_fastpath+0x16/0x1b
Jun 27 22:19:32 arekh kernel: [  130.771383] ---[ end trace b32a929e3e5ce906
]---
Jun 27 22:19:32 arekh kernel: [  130.787534] radeon 0000:01:00.0: GPU softreset
Comment 1 Nicolas Mailhot 2011-06-27 21:22:46 UTC
Created attachment 63592 [details]
system logs
Comment 2 Andrew Morton 2011-06-28 22:49:46 UTC
Recategorised as DRI, marked as a regression.
Comment 3 Alex Deucher 2011-06-28 23:33:33 UTC
It sounds like your card may be going bad if the bios/grub screen is messed up.  That happens before the driver even loads.  What happens if you use a vga console?  Boot up to a non-X runlevel and add radeon.modeset=0 to your kernel command line in grub.  Is it just new kernels or does the new behavior manifest on older kernels as well?  Make sure your GPU fan is functional and clear of dust.
Comment 4 Nicolas Mailhot 2011-07-09 12:35:07 UTC
So I ordered a new gpu card and it seems you were right, the gpu was bad (the gpu itself was fine and properly cooled but an heatsink had fallen of from one gddr package — this didn't appear in temp readings)

I do't think this should be enough to crash the kernel but I suppose there are more fruitful priorities right now

Thank you for the advice

Note You need to log in before you can comment on or make changes to this bug.