Bug 11337

Summary: Warning in during hotplug on 2.6.27-rc2-git5
Product: Platform Specific/Hardware Reporter: Rafael J. Wysocki (rjw)
Component: x86-64Assignee: Rafael J. Wysocki (rjw)
Status: CLOSED CODE_FIX    
Severity: normal CC: andi-bz, bunk, mark.langsdorf
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27-rc2-git5 Subsystem:
Regression: --- Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: dmesg log from failed online/offline
Patch to try
Debug patch to check the ordering of CPU hotplug notifiers
Attempt to fix the problem

Description Rafael J. Wysocki 2008-08-14 08:53:34 UTC
Subject    : Warning in during hotplug on 2.6.27-rc2-git5
Submitter  : Mark Langsdorf <mark.langsdorf@amd.com>
Date       : 2008-08-12 21:56
References : http://marc.info/?l=linux-kernel&m=121857820413373&w=4

This entry is being used for tracking a regression from 2.6.26.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Mark Langsdorf 2008-08-18 08:50:26 UTC
Created attachment 17304 [details]
dmesg log from failed online/offline

At Andi K.'s request, I added a debug statement to the mce_cpu_callback and caused the crash.
Comment 2 Mark Langsdorf 2008-08-18 08:51:25 UTC
The system is a 2 socket Tyan motherboard populated with quad-core AMD processors.  It only fails reliably when offlining cpu4 or above.
Comment 3 Rafael J. Wysocki 2008-08-18 09:45:58 UTC
Please check if /sys/devices/system/machinecheck/machinecheck4/ is still present after you have offline CPU4 and if so, please see what is shown by
'ls /sys/devices/system/machinecheck/machinecheck4/'
Comment 4 Rafael J. Wysocki 2008-08-18 09:46:26 UTC
s/offline/offlined/
Comment 5 Mark Langsdorf 2008-08-18 09:54:05 UTC
/sys/devices/system/machinecheck/machinecheck4/ exists and contains the threshold_bank4 directory.

The other cores on the cpu (cpus 5, 6, 7) are missing the threshold_bank4 directory, though, which isn't right.  cpu0 has the master directory for
the first processor and cores 1,2,3 have symbolic links to it.

Now I'm wondering if the bug has always been in the code but is only being exposed by quadcores.
Comment 6 Rafael J. Wysocki 2008-08-18 10:05:06 UTC
(In reply to comment #5)
> /sys/devices/system/machinecheck/machinecheck4/ exists and contains the
> threshold_bank4 directory.

This one is causing the problem to happen, then.

> The other cores on the cpu (cpus 5, 6, 7) are missing the threshold_bank4
> directory, though, which isn't right.  cpu0 has the master directory for
> the first processor and cores 1,2,3 have symbolic links to it.

That's what I'm observing on my single-socket quad-core.

> Now I'm wondering if the bug has always been in the code but is only being
> exposed by quadcores.

I think the problem is that there are two multicore CPUs, rather than they are quad-core.  Namely, IMO, mce_remove_device() has a problem with removing the subdirectory from machinecheck4/, because it's not empty at that time.
Comment 7 Rafael J. Wysocki 2008-08-18 10:14:18 UTC
Created attachment 17309 [details]
Patch to try

I wonder if this patch helps (untested).
Comment 8 Mark Langsdorf 2008-08-18 10:25:54 UTC
It does not.  
Comment 9 Rafael J. Wysocki 2008-08-18 11:10:10 UTC
Then IMO we have to teach mce_remove_device() to remove the threshold_bank4 subdirectory explicitly for the first core of each non-boot CPU.
Comment 10 Rafael J. Wysocki 2008-08-18 11:23:14 UTC
Well, this one is added by AMD MCE (mce_amd_64.c), but it should work as long
as the ordering of threshold_cpu_callback() and mce_cpu_callback() is correct.

I'll prepare a debug patch to check that in a while.
Comment 11 Rafael J. Wysocki 2008-08-18 13:05:02 UTC
Created attachment 17314 [details]
Debug patch to check the ordering of CPU hotplug notifiers

Please try to offline CPU4 with this patch applied and see if the MCE printk goes after the MCE AMD64 one.
Comment 12 Mark Langsdorf 2008-08-19 08:02:15 UTC
The sysfs attributes are removed by mce_64.c before threshold bank4 is removed by mce_amd_64.c.
Comment 13 Andi Kleen 2008-08-19 08:10:15 UTC
Ugly. Just setting the priority is not enough. To fix it we would need two notifiers because the needed order is different for hotadd versus hot-remove.
Perhaps it would be best to just call the amd functions from the main mce
notifier.
Comment 14 Andi Kleen 2008-08-19 08:27:21 UTC
BTW an alternative would be to just remove the mce_amd_64 sysfs code. Instead just
always set a threshold of 1 -- i don't think anything else is particularly
useful.
Comment 15 Mark Langsdorf 2008-08-19 09:29:37 UTC
A threshold setting of 1 defeats the purpose, at least in the case of ECC DRAM.  Dense arrays of DRAM can be expected to generate errors just from random cosmic radiation.  The point of setting a threshold is to make those standard errors and only pick up the truly significant ones caused by the DRAM failing.
Comment 16 Rafael J. Wysocki 2008-08-19 10:14:43 UTC
(In reply to comment #13)
> Ugly. Just setting the priority is not enough. To fix it we would need two
> notifiers because the needed order is different for hotadd versus hot-remove.
> Perhaps it would be best to just call the amd functions from the main mce
> notifier.

That would be the best solution IMO.

Still, that's not the only problem here, because the wrong ordering alone would have left just the empty machinecheck4/ subdirectory.  It looks like threshold_remove_bank() should do more than just kobject_put() to remove the threshold_bank4 directory.
Comment 17 Rafael J. Wysocki 2008-08-20 15:15:28 UTC
Created attachment 17346 [details]
Attempt to fix the problem

This patch is supposed to fix the problem (if I did it right), but I can't really test it, because I don't have the hardware the problem is reproducible on.
Comment 18 Andi Kleen 2008-08-20 20:13:56 UTC
Patch looks good.

Acked-by: Andi Kleen 
Comment 19 Rafael J. Wysocki 2008-08-21 06:38:38 UTC
Mark, could you test it, please?
Comment 20 Mark Langsdorf 2008-08-21 08:57:39 UTC
That fixed the problem for me.  Thanks!

Tested-by: Mark Langsdorf <mark.langsdorf@amd.com>
Comment 21 Rafael J. Wysocki 2008-08-21 10:41:12 UTC
Thanks!

Could you please also do one additional test for me?

Namely, the patch adds 'kobject_del(b->kobj);' in 'hreshold_remove_bank()'.  Can you please remove that 'kobject_del(b->kobj);' (in theory it shouldn't be necessary) and see if the problem is still fixed?
Comment 22 Rafael J. Wysocki 2008-08-21 10:42:05 UTC
s/hreshold_remove_bank/threshold_remove_bank/ , sorry
Comment 23 Mark Langsdorf 2008-08-22 08:26:03 UTC
The theory is incorrect - the kobject_del is required.  I get a crash without it.
Comment 24 Rafael J. Wysocki 2008-08-22 09:09:22 UTC
Thanks for testing.

I'm going to push the patch from Comment #20 upstream.
Comment 25 Adrian Bunk 2008-08-25 13:51:13 UTC
fixed by commit 8735728ef8dc935c4fb351f913758fdbb62c308d