Bug 73931 - rmmod radeon and kernel crash
Summary: rmmod radeon and kernel crash
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-13 06:48 UTC by Pali Rohár
Modified: 2018-06-08 10:41 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.14
Subsystem:
Regression: No
Bisected commit-id:


Attachments
rmmod_radeon_kernel_panic (12.65 KB, text/plain)
2014-04-13 06:48 UTC, Pali Rohár
Details
possible fix (1.74 KB, patch)
2014-04-14 14:36 UTC, Alex Deucher
Details | Diff
syslog output after modprobe radeon (13.94 KB, text/plain)
2014-04-14 16:11 UTC, Pali Rohár
Details
possible fix (2.28 KB, patch)
2014-04-14 16:53 UTC, Alex Deucher
Details | Diff
pstore log (52.01 KB, text/plain)
2014-04-14 19:06 UTC, Pali Rohár
Details
possible fix v2 (4.52 KB, patch)
2014-04-14 20:53 UTC, Alex Deucher
Details | Diff
pstore log (27.97 KB, text/plain)
2014-04-14 21:25 UTC, Pali Rohár
Details
dmesg plymouth log (10.07 KB, text/plain)
2014-04-14 21:44 UTC, Pali Rohár
Details

Description Pali Rohár 2014-04-13 06:48:14 UTC
Created attachment 132051 [details]
rmmod_radeon_kernel_panic

After calling rmmod radeon, kernel show lots of error lines and then crash.

It looks like that radeon module does not cleanup hwmon interface at exit. After calling rmmod radeon there is still hwmon interface:

$ readlink /sys/class/hwmon/hwmon1
../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/hwmon/hwmon1

And after calling ls, or cat in hwmon1 kernel crash...

See attachment from syslog.

$ lspci -nn
01:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Sun XT [Radeon HD 8670A/8670M/8690M] [1002:6660]
Comment 1 Alex Deucher 2014-04-14 14:36:33 UTC
Created attachment 132201 [details]
possible fix

Does the attached patch help?
Comment 2 Pali Rohár 2014-04-14 16:11:46 UTC
Created attachment 132221 [details]
syslog output after modprobe radeon

Yes, your patch fixing original problem. Maybe this is candidate for stable releases. Patch tested on 3.14 and system working fine after rmmoding radeon module, no crash after calling: find /sys

But now there is another new kernel crash. When I modprobe radeon module again (after previous successful rmmod), kernel crash. See syslog output in attachment.
Comment 3 Alex Deucher 2014-04-14 16:53:13 UTC
Created attachment 132231 [details]
possible fix

Does this help in the second case?
Comment 4 Pali Rohár 2014-04-14 18:27:25 UTC
No does not help, kernel still crashing. But now I cannot provide syslog output, because userspace rsyslog daemon does not read log from kernel and write data to disk.. Plus output on framebuffer screen is very quickly overwritten, so I cannot capture it.
Comment 5 Pali Rohár 2014-04-14 19:06:25 UTC
Created attachment 132251 [details]
pstore log

Now I found pstore and its efi backend...

I modprobed efi-pstore before rmmoding radeon and dmesg logs were stored into efi after kernel crash. So I belive that something usefull is there for you.

Attachment generated by:
$ cd /sys/fs/pstore/; cat `ls -r *1; ls -r *2`
Comment 6 Alex Deucher 2014-04-14 20:53:30 UTC
Created attachment 132261 [details]
possible fix v2

Updated patch.
Comment 7 Pali Rohár 2014-04-14 21:25:33 UTC
Created attachment 132281 [details]
pstore log

Ok, now kernel does not crash after loading radeon module again. I modprobed & rmmoded it more times, there was no problem But after I started Xserver (when radeon module was loaded), I got another kernel crash. See output from efi pstore.
Comment 8 Pali Rohár 2014-04-14 21:44:06 UTC
Created attachment 132291 [details]
dmesg plymouth log

Similar/same problem happends if I start plymouth splash screen (which using intel fb) and then I load radeon module.
Comment 9 Pali Rohár 2014-04-21 12:53:50 UTC
Any idea about what to do with last two NULL pointer dereference in radeon_driver_open_kms?
Comment 10 Pali Rohár 2014-04-30 14:22:56 UTC
@Alex Deucher: ping
Comment 11 Mateusz Lenik 2018-06-08 10:41:05 UTC
Same issue for amdgpu after unbind (not sure if this should be a separate bug):

rook ~ ➤ ls -l /sys/class/hwmon/hwmon1/device
lrwxrwxrwx 1 root root 0 cze  8 12:32 /sys/class/hwmon/hwmon1/device -> ../../../0000:03:00.0
rook ~ ➤ cat /sys/class/hwmon/hwmon1/fan1_input   
[1]    9145 killed     cat /sys/class/hwmon/hwmon1/fan1_input

Reading fan1_input causes an OOPS:
[  590.507564] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
[  590.507584] IP: amdgpu_hwmon_get_fan1_input+0x33/0x84
[  590.507587] PGD 0 P4D 0 
[  590.507593] Oops: 0000 [#4] PREEMPT SMP PTI
[  590.507597] Modules linked in:
[  590.507610] CPU: 39 PID: 9222 Comm: cat Tainted: G      D          4.16.14-gentoo #3
[  590.507613] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017
[  590.507617] RIP: 0010:amdgpu_hwmon_get_fan1_input+0x33/0x84
[  590.507620] RSP: 0018:ffff97790d2dfd68 EFLAGS: 00010246
[  590.507624] RAX: 0000000000000000 RBX: ffff9147b5cd3000 RCX: ffff9137b5428ac8
[  590.507627] RDX: ffff9137b5aa0000 RSI: ffffffffbbb3e1c0 RDI: ffff9137b55f7008
[  590.507630] RBP: fffffffffffffffb R08: 0000000000000001 R09: 0000000000000000
[  590.507633] R10: ffff9137af38d400 R11: 0000000000000000 R12: ffffffffbb30d000
[  590.507636] R13: ffff9147b590d400 R14: ffff91476a16db00 R15: 0000000000000001
[  590.507639] FS:  00007f3838cbe540(0000) GS:ffff9147bf400000(0000) knlGS:0000000000000000
[  590.507641] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  590.507643] CR2: 00000000000000b8 CR3: 0000002023776005 CR4: 00000000003606e0
[  590.507645] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  590.507647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  590.507649] Call Trace:
[  590.507660]  dev_attr_show+0x23/0x44
[  590.507668]  sysfs_kf_seq_show+0x7f/0xce
[  590.507676]  seq_read+0x1c1/0x3d1
[  590.507687]  __vfs_read+0x33/0xcc
[  590.507693]  vfs_read+0x9a/0xcf
[  590.507696]  SyS_read+0x5f/0xa3
[  590.507703]  do_syscall_64+0x79/0x88
[  590.507711]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  590.507715] RIP: 0033:0x7f38387f8b75
[  590.507718] RSP: 002b:00007ffff2570970 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  590.507721] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f38387f8b75
[  590.507723] RDX: 0000000000020000 RSI: 00007f3838cd0000 RDI: 0000000000000003
[  590.507725] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000
[  590.507727] R10: 000000000000039b R11: 0000000000000246 R12: 00007f3838cd0000
[  590.507729] R13: 0000000000000003 R14: 00007f3838cd000f R15: 0000000000020000
[  590.507738] Code: d3 48 83 ec 10 48 8b 97 18 01 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 c7 44 24 04 00 00 00 00 48 8b 82 08 49 00 00 <48> 8b 80 b8 00 00 00 48 85 c0 74 15 48 8b ba f8 48 00 00 48 8d 
[  590.507821] RIP: amdgpu_hwmon_get_fan1_input+0x33/0x84 RSP: ffff97790d2dfd68
[  590.507824] CR2: 00000000000000b8
[  590.507830] ---[ end trace eaed7563e433ab4e ]---

Note You need to log in before you can comment on or make changes to this bug.