Subject : Warning in during hotplug on 2.6.27-rc2-git5 Submitter : Mark Langsdorf <mark.langsdorf@amd.com> Date : 2008-08-12 21:56 References : http://marc.info/?l=linux-kernel&m=121857820413373&w=4 This entry is being used for tracking a regression from 2.6.26. Please don't close it until the problem is fixed in the mainline.
Created attachment 17304 [details] dmesg log from failed online/offline At Andi K.'s request, I added a debug statement to the mce_cpu_callback and caused the crash.
The system is a 2 socket Tyan motherboard populated with quad-core AMD processors. It only fails reliably when offlining cpu4 or above.
Please check if /sys/devices/system/machinecheck/machinecheck4/ is still present after you have offline CPU4 and if so, please see what is shown by 'ls /sys/devices/system/machinecheck/machinecheck4/'
s/offline/offlined/
/sys/devices/system/machinecheck/machinecheck4/ exists and contains the threshold_bank4 directory. The other cores on the cpu (cpus 5, 6, 7) are missing the threshold_bank4 directory, though, which isn't right. cpu0 has the master directory for the first processor and cores 1,2,3 have symbolic links to it. Now I'm wondering if the bug has always been in the code but is only being exposed by quadcores.
(In reply to comment #5) > /sys/devices/system/machinecheck/machinecheck4/ exists and contains the > threshold_bank4 directory. This one is causing the problem to happen, then. > The other cores on the cpu (cpus 5, 6, 7) are missing the threshold_bank4 > directory, though, which isn't right. cpu0 has the master directory for > the first processor and cores 1,2,3 have symbolic links to it. That's what I'm observing on my single-socket quad-core. > Now I'm wondering if the bug has always been in the code but is only being > exposed by quadcores. I think the problem is that there are two multicore CPUs, rather than they are quad-core. Namely, IMO, mce_remove_device() has a problem with removing the subdirectory from machinecheck4/, because it's not empty at that time.
Created attachment 17309 [details] Patch to try I wonder if this patch helps (untested).
It does not.
Then IMO we have to teach mce_remove_device() to remove the threshold_bank4 subdirectory explicitly for the first core of each non-boot CPU.
Well, this one is added by AMD MCE (mce_amd_64.c), but it should work as long as the ordering of threshold_cpu_callback() and mce_cpu_callback() is correct. I'll prepare a debug patch to check that in a while.
Created attachment 17314 [details] Debug patch to check the ordering of CPU hotplug notifiers Please try to offline CPU4 with this patch applied and see if the MCE printk goes after the MCE AMD64 one.
The sysfs attributes are removed by mce_64.c before threshold bank4 is removed by mce_amd_64.c.
Ugly. Just setting the priority is not enough. To fix it we would need two notifiers because the needed order is different for hotadd versus hot-remove. Perhaps it would be best to just call the amd functions from the main mce notifier.
BTW an alternative would be to just remove the mce_amd_64 sysfs code. Instead just always set a threshold of 1 -- i don't think anything else is particularly useful.
A threshold setting of 1 defeats the purpose, at least in the case of ECC DRAM. Dense arrays of DRAM can be expected to generate errors just from random cosmic radiation. The point of setting a threshold is to make those standard errors and only pick up the truly significant ones caused by the DRAM failing.
(In reply to comment #13) > Ugly. Just setting the priority is not enough. To fix it we would need two > notifiers because the needed order is different for hotadd versus hot-remove. > Perhaps it would be best to just call the amd functions from the main mce > notifier. That would be the best solution IMO. Still, that's not the only problem here, because the wrong ordering alone would have left just the empty machinecheck4/ subdirectory. It looks like threshold_remove_bank() should do more than just kobject_put() to remove the threshold_bank4 directory.
Created attachment 17346 [details] Attempt to fix the problem This patch is supposed to fix the problem (if I did it right), but I can't really test it, because I don't have the hardware the problem is reproducible on.
Patch looks good. Acked-by: Andi Kleen
Mark, could you test it, please?
That fixed the problem for me. Thanks! Tested-by: Mark Langsdorf <mark.langsdorf@amd.com>
Thanks! Could you please also do one additional test for me? Namely, the patch adds 'kobject_del(b->kobj);' in 'hreshold_remove_bank()'. Can you please remove that 'kobject_del(b->kobj);' (in theory it shouldn't be necessary) and see if the problem is still fixed?
s/hreshold_remove_bank/threshold_remove_bank/ , sorry
The theory is incorrect - the kobject_del is required. I get a crash without it.
Thanks for testing. I'm going to push the patch from Comment #20 upstream.
fixed by commit 8735728ef8dc935c4fb351f913758fdbb62c308d