Bug 214469 - mce_setup BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
Summary: mce_setup BUG: using smp_processor_id() in preemptible [00000000] code: swapp...
Status: NEW
Alias: None
Product: Process Management
Classification: Unclassified
Component: Preemption (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: acpi_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-20 12:30 UTC by Nicholas Fries
Modified: 2021-10-30 06:24 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.14.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Nicholas Fries 2021-09-20 12:30:47 UTC
System reboots and spits out the BERT entry for an unexpected / firmware initiated reboot. This is great, except, following "MSR Address" there should be an entry for "Register Array" which contains the information needed to decode the error. Instead, we see "BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1" followed by a stack trace.

I've filed this under ACPI, since mce_setup is being called by apei_smca_report_x86_error, and I believe APEI is a function of ACPI, but please  move/reclassify it as needed.

System is AMD Ryzen Threadripper PRO 3975WX 32-Cores, 128GB of ECC memory, running "Linux 5.14.2-gentoo-x86_64 #2 SMP PREEMPT"

Please let me know if you need more information and I will provide it.

--

[  +0.000234] BERT: Error records from previous boot:
[  +0.000001] [Hardware Error]: event severity: fatal
[  +0.000001] [Hardware Error]:  Error 0, type: fatal
[  +0.000000] [Hardware Error]:  fru_text: ProcessorError
[  +0.000001] [Hardware Error]:   section_type: IA32/X64 processor error
[  +0.000001] [Hardware Error]:   Local APIC_ID: 0x3f
[  +0.000001] [Hardware Error]:   CPUID Info:
[  +0.000001] [Hardware Error]:   00000000: 00830f10 00000000 3f400800 00000000
[  +0.000001] [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[  +0.000001] [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[  +0.000001] [Hardware Error]:   Error Information Structure 0:
[  +0.000000] [Hardware Error]:    Error Structure Type: cache error
[  +0.000001] [Hardware Error]:    Check Information: 0x00000000064d001f
[  +0.000001] [Hardware Error]:     Transaction Type: 1, Data Access
[  +0.000000] [Hardware Error]:     Operation: 3, data read
[  +0.000001] [Hardware Error]:     Level: 1
[  +0.000000] [Hardware Error]:     Processor Context Corrupt: true
[  +0.000001] [Hardware Error]:     Uncorrected: true
[  +0.000000] [Hardware Error]:   Context Information Structure 0:
[  +0.000001] [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[  +0.000000] [Hardware Error]:    Register Array Size: 0x0050
[  +0.000001] [Hardware Error]:    MSR Address: 0xc0002001
[  +0.000001] BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
[  +0.000000] caller is mce_setup+0x32/0x100
[  +0.000004] CPU: 32 PID: 1 Comm: swapper/0 Not tainted 5.14.2-gentoo-x86_64 #2 ce26423c1c94c39ff48a0733dda399ae462242aa
[  +0.000002] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE WIFI, BIOS 0602 07/13/2021
[  +0.000001] Call Trace:
[  +0.000002]  dump_stack_lvl+0x34/0x44
[  +0.000003]  check_preemption_disabled+0xd8/0xe0
[  +0.000002]  mce_setup+0x32/0x100
[  +0.000002]  apei_smca_report_x86_error+0x68/0x140
[  +0.000003]  cper_print_proc_ia.cold+0x3e4/0x5f8
[  +0.000002]  ? vprintk_emit+0xf7/0x1a0
[  +0.000002]  ? em_dev_register_perf_domain.cold+0x11/0xa3
[  +0.000003]  cper_estatus_print_section+0x813/0x9ea
[  +0.000002]  ? snprintf+0x49/0x60
[  +0.000003]  cper_estatus_print+0xad/0xe8
[  +0.000001]  bert_init+0x1b2/0x214
[  +0.000004]  ? setup_bert_disable+0x12/0x12
[  +0.000001]  do_one_initcall+0x41/0x1f0
[  +0.000002]  kernel_init_freeable+0x1fe/0x265
[  +0.000003]  ? rest_init+0xd0/0xd0
[  +0.000001]  kernel_init+0x16/0x110
[  +0.000001]  ret_from_fork+0x1f/0x30
[  +0.000007] [Hardware Error]:  Error 1, type: recoverable
[  +0.000001] [Hardware Error]:  fru_text: PcieError
[  +0.000001] [Hardware Error]:   section_type: PCIe error
[  +0.000000] [Hardware Error]:   port_type: 4, root port
[  +0.000001] [Hardware Error]:   version: 0.2
[  +0.000000] [Hardware Error]:   command: 0x0003, status: 0x0010
[  +0.000001] [Hardware Error]:   device_id: 0000:00:03.1
[  +0.000001] [Hardware Error]:   slot: 0
[  +0.000000] [Hardware Error]:   secondary_bus: 0x01
[  +0.000001] [Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
[  +0.000000] [Hardware Error]:   class_code: 060400
[  +0.000001] [Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0010
[  +0.000018] mce: [Hardware Error]: Machine check events logged
[  +0.000001] mce: [Hardware Error]: CPU 63: Machine Check: 0 Bank 0: be802800000c0135
[  +0.000018] mce: [Hardware Error]: TSC 0 ADDR 100fffdfc001080 MISC d01c0ff500000000 IPID b000000000 
[  +0.000018] mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1632132503 SOCKET 0 APIC 3f microcode 830104d
Comment 1 Nicholas Fries 2021-09-21 07:24:15 UTC
I thought this was a function of ACPI, due to BERT and AEPI being involved. 

Underlying issue seems to be in arch/x86/kernel/cpu/mce/core.c line 136 where I see the following:

/* Do initial initialization of a struct mce */
noinstr void mce_setup(struct mce *m)
{
       	memset(m, 0, sizeof(struct mce));
       	m->cpu = m->extcpu = smp_processor_id();


...

Still confirming but this would not be a function of ACPI, simply getting hints from there as to there being an error. Please correct me if I'm wrong.
Comment 2 Nicholas Fries 2021-09-21 15:59:17 UTC
I've made a change to my local kernel where I'm using get_cpu(), put_cpu() instead of smp_processor_id(). Waiting for the next crash to see if the MCE Register Array is printed.
Comment 3 Nicholas Fries 2021-10-30 06:24:02 UTC
Just wanted to update this - my local change resolved the issue.

Note You need to log in before you can comment on or make changes to this bug.