Bug 219525

Summary: asus-ec-sensors: Concurrent access to the ACPI EC detected
Product: Drivers Reporter: Nick Owens (mischief)
Component: Hardware MonitoringAssignee: Denis Pauk (pauk.denis)
Status: RESOLVED UNREPRODUCIBLE    
Severity: low CC: eugene.shalygin, jdelvare, pauk.denis
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: lsmod
sensors
kconfig
acpidump -b -n DSDT -z
acpidump -b -n DSDT -z, bios 2506

Description Nick Owens 2024-11-24 08:28:26 UTC
beast ~ # udevadm info /sys/class/hwmon/hwmon6
P: /devices/platform/asus-ec-sensors/hwmon/hwmon6
M: hwmon6
R: 6
U: hwmon
E: DEVPATH=/devices/platform/asus-ec-sensors/hwmon/hwmon6
E: SUBSYSTEM=hwmon

beast ~ # dmesg|tail
[ 1964.164303] hwmon hwmon6: Concurrent access to the ACPI EC detected.
               Race condition possible.
[ 1969.209293] hwmon hwmon6: Concurrent access to the ACPI EC detected.
               Race condition possible.
[ 1974.165299] hwmon hwmon6: Concurrent access to the ACPI EC detected.
               Race condition possible.
[ 1979.167294] hwmon hwmon6: Concurrent access to the ACPI EC detected.
               Race condition possible.
[ 1984.163645] hwmon hwmon6: Concurrent access to the ACPI EC detected.
               Race condition possible.

beast ~ # inxi -MS
System:
  Host: beast Kernel: 6.11.9-gentoo arch: x86_64 bits: 64
  Desktop: spectrwm v: 3.6.0 Distro: Gentoo Base System release 2.17
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: N/A
  Mobo: ASUSTeK model: ProArt X670E-CREATOR WIFI v: Rev 1.xx
    serial: 221213904400284 UEFI: American Megatrends v: 2007 date: 04/12/2024

happy to provide more information if needed.
Comment 1 Artem S. Tashkinov 2024-11-24 11:53:39 UTC
Denis,

Could you take a look please?
Comment 2 Jean Delvare 2024-11-24 13:11:48 UTC
Which other hardware monitoring drivers are being used on the system? The output of "sensors" should tell.
Comment 3 Denis Pauk 2024-11-24 17:34:27 UTC
I suppose it should be nct6775 + ec-sensor. 

At least in https://github.com/zeule/asus-ec-sensors/issues/50 has been discussed adding such board to list of supported boards.

Also it looks like same issue was discussed for other board: https://github.com/zeule/asus-ec-sensors/issues/43

Also have you added some patch or module/kernel parameters? Could you please check?

Thank you!
Comment 4 Nick Owens 2024-11-24 23:37:42 UTC
Created attachment 307270 [details]
lsmod
Comment 5 Nick Owens 2024-11-24 23:37:59 UTC
Created attachment 307271 [details]
sensors
Comment 6 Nick Owens 2024-11-24 23:38:19 UTC
Created attachment 307272 [details]
kconfig
Comment 7 Nick Owens 2024-11-24 23:54:13 UTC
denis,

nct6775 is explicitly loaded on my system - i believe i did this after our prior discussion in https://bugzilla.kernel.org/show_bug.cgi?id=204807#c302.

on my current boot i don't see 'Concurrent access to the ACPI EC detected' any more, but i don't think i changed anything, so perhaps a race indeed.

with asus_ec_sensors should nct6775 no longer be used?

thanks,
nick
Comment 8 Jean Delvare 2024-11-25 10:14:40 UTC
Generally speaking, if two drivers access the same hardware without proper locking, it's going to be racy, and this is what is being reported here. This is especially bad for devices which are accessed using a single address, data I/O port pair, and even more so when the device uses a banked register map (the same register address points to different hardware registers depending on which bank was previously selected).

So to be on the safe side, it's better to use either driver but not both at the same time. Using both drivers may not bring much anyway, if they report the same sensor values.

That being said, I see that the nct6775 driver already includes some code to deal with concurrent access to the monitoring device on certain Asus boards. On these boards, the nct6775 driver uses ACPI methods to access the hardware instead of direct I/O, and this is supposed to be safe. And your board (ProArt X670E-CREATOR WIFI) is in the list, so it's supposed to be already covered.

Can you try enabling debugging:

# echo 'file nct6775.c +p' > /sys/kernel/debug//dynamic_debug/control

Then unload and reload the nct6775 driver, and check whether the following statement is being added to the kernel log:

"Using Asus WMBD method of ..."

I see that your board was added to the list by Denis so probably he knows better than me and I'll let him drive from here.
Comment 9 Denis Pauk 2024-11-25 11:40:38 UTC
Both of modules should be loaded as provided access to different sensors. 

nct6775 uses acpi methods for access to io. So it's mostly responsibility of firmware to correctly use locks, or at least we expected so.

Ec-sensors module uses non wmi access to ec region, as wmi access is too slow and by testing in some cases took up to several seconds. As result module has list of mutexes for each of board that should be acquired before make any of operations, mutex name are checked manually from acpi tables before add support for a such board.

By https://github.com/zeule/asus-ec-sensors/issues/43, it looks like mutex could be selected incorrectly or some other tools has tried to receive access to ec without acquiring lock used in module.

I suppose need to load both of modules with enabled debug and check which operation are conflicting.
Comment 10 Denis Pauk 2024-11-25 11:52:43 UTC
I suppose Eugene knows little bit more. 

Could you please correct me if I have missed something?
Comment 11 Eugene Shalygin 2024-11-25 12:05:52 UTC
To what Denis wrote already in #9, I can add that the for the AMD 600 series ASUS changed the WMI code to read sensor reading from the EC. I did not bother to decipher it (yet?), because the direct reading from EC registers works for all models so far. I might remember we observed for one of the board models ASUS changed DSDT code with BIOS update so that the specific EC/HW monitor mutex was replaces with the global ACPI one.

Further investigation requires reading the DSDT code. @Nick, could you, please, share that (please the raw dump, not decompiled one)?
Comment 12 Nick Owens 2024-11-25 20:09:13 UTC
Created attachment 307277 [details]
acpidump -b -n DSDT -z

attached DSDT.
Comment 13 Nick Owens 2024-11-25 23:38:29 UTC
i struggled to get dynamic debug to work with reloading the driver - i think that using the debugfs interface doesn't work if you unload the driver (the entries for the module disappear). i had success with `modprobe nct6775 dyndbg=+p`:

[72417.468361] nct6775: Using Asus WMBD method of AsusMbSwInterface to access 0xc1 chip.
[72417.468431] nct6775: Found NCT6796D-S/NCT6799D-R or compatible chip at 0x2e:0x290

for asus_ec_sensors, only:

[72513.918621] asus-ec-sensors asus-ec-sensors: board has 5 EC sensors that span 5 registers

nothing is printed in dmesg when running `sensors`.
Comment 14 Nick Owens 2024-11-26 00:21:49 UTC
Created attachment 307278 [details]
acpidump -b -n DSDT -z, bios 2506

i decided to update my bios to v2506, new DSDT attached. still no more messages about race.
Comment 15 Artem S. Tashkinov 2024-11-26 13:06:54 UTC
(In reply to Nick Owens from comment #14)
> still no more messages about race.

The problem has resolved itself?
Comment 16 Nick Owens 2024-11-26 13:55:06 UTC
(In reply to Artem S. Tashkinov from comment #15)
> (In reply to Nick Owens from comment #14)
> > still no more messages about race.
> 
> The problem has resolved itself?

unfortunately, yes. in my logs i have 11 recorded boots across 3 kernel versions (6.10.6-gentoo, 6.11.5-gentoo-r1, 6.11.5-gentoo-r1) and these messages only ever appear in one boot starting 2024-11-23, lasting for 1 day, and not appearing since.

since i've not seen it at all since, and i updated my bios, and i will hopefully upgrade to 6.12 soon, i think it is safe to just close this unless it starts appearing again.
Comment 17 Nick Owens 2024-11-29 09:46:33 UTC
not seen since i originally reported it.
Comment 18 Jean Delvare 2024-12-02 10:50:01 UTC
(In reply to Denis Pauk from comment #9)
> Both of modules should be loaded as provided access to different sensors. 

I doubt it. If you look at the values, there's a correlation between the temperatures reported by asusec and the temperatures reported by nct6799:

CPU (asusec) matches PECI/TSI Agent 0 Calibration (nct6799)
CPU Package (asusec) matches TSI0_TEMP (nct6799) or Tctl (k10temp)
Motherboard (asusec) matches SYSTIN (nct6799)
T_Sensor (asusec) matches AUXTIN5 (nct6799)
Only VRM (asusec) has no nct6799 nor k10temp match.

This is also expected as asusec is only an extra software interface on top of the hardware (much like Asus' ATK0110 or Abit's µGuru 2 decades ago). Temperature values do need physical sensors on the board or CPU, and these are the values reported by nct6799 (and k10temp).

The fact that the asus-ec-sensors driver once reported a concurrent access is actually a strong hint that two drivers attempted to access the same physical registers (but I did not look at the code so I can't tell more).
Comment 19 Eugene Shalygin 2024-12-02 11:06:57 UTC
(In reply to Jean Delvare from comment #18)
> (In reply to Denis Pauk from comment #9)
> > Both of modules should be loaded as provided access to different sensors. 
> 
> I doubt it. 

I wrote the asusec driver to get access to motherboard sensors readings unavailable via the SIO chip. For example, T_Sensor does not always present in the SIO chip, water temperatures are only available via EC registers. "Extreme" ASUS boards have even more temperature sensors.

> The fact that the asus-ec-sensors driver once reported a concurrent access
> is actually a strong hint that two drivers attempted to access the same 
> physical registers (but I did not look at the code so I can't tell more).

The board firmware itself accesses EC registers. For example, one can make the chipset fan on X570 spin at full speed by looping over EC registers. That's the main reason why I put that warning message in the asusec code.