Bug 101971

Summary: Regression with 'x86/cacheinfo: Move cacheinfo sysfs code to generic infrastructure' on AMD i686
Product: Platform Specific/Hardware Reporter: Philip Müller (philm)
Component: i386Assignee: platform_i386
Status: NEW ---    
Severity: high CC: linux, philm, pomidorabelisima
Priority: P1    
Hardware: i386   
OS: Linux   
URL: https://github.com/manjaro/packages-core/issues/14
Kernel Version: 4.1, 4.1.x, 4.2-rcX Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Bisect (0d55ba4)
Kernel panic (part1)
Kernel panic (part2)
Fix NULL pointer dereference in the error/cleanup path
Fix cache_shared_cpu_map_remove() checking for check sib_cpu_ci->info_list
Final upstream patch

Description Philip Müller 2015-07-26 08:15:53 UTC
Created attachment 183661 [details]
Bisect (0d55ba4)

Hi all,

'x86/cacheinfo: Move cacheinfo sysfs code to generic infrastructure' (commit 0d55ba46bfbee64fd2b492b87bfe2ec172e7b056) creates an regression on AMD 32bit architecture.

The facts are:

- You can't boot on i686 with more than one CPU core on AMD hardware (x86_64 however works)
- By reverting 0d55ba4[4] the kernel boots.

On error it produces followed error message:

Failed to access perfctr msr (MSR c0010007 is 0)

task: f58e0000 ti: f58e8000 task.ti: f58e800
EIP: 0060:[<c135a903>] EFLAGS: 00010206 CPU: 0
EIP is at free_cache_attributes+0x83/0xd0
EAX: 00000001 EBX: f589d46c ECX: 00000090 EDX: 360c2000
ESI: 00000000 EDI: c1724a80 EBP: f58e9ec0 ESP: f58e9ea0
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 000000ac CR3: 01731000 CR4: 000006d0

More detailed info can be found here:
https://github.com/manjaro/packages-core/issues/14

kind regards
Philip Müller
--------------------------
Manjaro Project-Lead
Comment 1 Philip Müller 2015-07-26 08:18:30 UTC
Created attachment 183671 [details]
Kernel panic (part1)
Comment 2 Philip Müller 2015-07-26 08:18:49 UTC
Created attachment 183681 [details]
Kernel panic (part2)
Comment 3 Philip Müller 2015-07-26 10:17:24 UTC
Created attachment 183741 [details]
Fix NULL pointer dereference in the error/cleanup path

That's a trivial NULL pointer dereference in the error/cleanup
path. Patch below should fix it.

Thanks,
tglx

https://lkml.org/lkml/2015/7/26/16
Comment 4 Philip Müller 2015-07-26 10:20:49 UTC
Created attachment 183751 [details]
Fix cache_shared_cpu_map_remove() checking for check sib_cpu_ci->info_list

Well, I got a bit different, and of course totally untested possible
solution:

cache_shared_cpu_map_setup() does check sib_cpu_ci->info_list before
setting cpumask bits while cache_shared_cpu_map_remove() doesn't. Ballancing
this out would mean (see attachment).

-- 
Regards/Gruss,
Boris.

https://lkml.org/lkml/2015/7/26/20
Comment 5 Philip Müller 2015-07-27 21:09:51 UTC
Created attachment 183881 [details]
Final upstream patch

Philip Müller reported a hang when booting 32-bit 4.1 kernel on an AMD
box. A fragment of the splat was enough to pinpoint the issue:

  task: f58e0000 ti: f58e8000 task.ti: f58e800
  EIP: 0060:[<c135a903>] EFLAGS: 00010206 CPU: 0
  EIP is at free_cache_attributes+0x83/0xd0
  EAX: 00000001 EBX: f589d46c ECX: 00000090 EDX: 360c2000
  ESI: 00000000 EDI: c1724a80 EBP: f58e9ec0 ESP: f58e9ea0
   DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
  CR0: 8005003b CR2: 000000ac CR3: 01731000 CR4: 000006d0

cache_shared_cpu_map_setup() did check sibling CPUs cacheinfo descriptor
while the respective teardown path cache_shared_cpu_map_remove() didn't.
Fix that.

From tglx's version: to be on the safe side, move the cacheinfo
descriptor check to free_cache_attributes(), thus cleaning up the
hotplug path a little and making this even more robust.

-- 
Regards/Gruss,
    Boris.
Comment 6 poma 2015-08-14 06:24:19 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1253566

Is fix falls in stable 4.1.6 and mainline 4.2-rc7?
Comment 7 Philip Müller 2015-08-14 17:33:14 UTC
The quick answer is no. The long answer you can find here:
https://lists.manjaro.org/pipermail/manjaro-dev/Week-of-Mon-20150803/000579.html
Comment 8 poma 2015-09-17 11:08:52 UTC
"Final upstream patch"
https://bugzilla.kernel.org/attachment.cgi?id=183881
[PATCH] cpu/cacheinfo: Fix teardown path

Booting OK on:
- lscpu | egrep op-mode\|Vendor
CPU op-mode(s):        32-bit, 64-bit
Vendor ID:             AuthenticAMD
- lscpu | egrep op-mode\|Vendor
CPU op-mode(s):        32-bit
Vendor ID:             AuthenticAMD

Tested on installed system (Fedora release 22)
with patched kernels:
- uname -r
4.1.5-201.fc22.i686
- uname -r
4.2.0-0.rc6.git0.4.fc22.i686