Bug 219525
Summary: | asus-ec-sensors: Concurrent access to the ACPI EC detected | ||
---|---|---|---|
Product: | Drivers | Reporter: | Nick Owens (mischief) |
Component: | Hardware Monitoring | Assignee: | Denis Pauk (pauk.denis) |
Status: | RESOLVED UNREPRODUCIBLE | ||
Severity: | low | CC: | eugene.shalygin, jdelvare, pauk.denis |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
lsmod
sensors kconfig acpidump -b -n DSDT -z acpidump -b -n DSDT -z, bios 2506 |
Description
Nick Owens
2024-11-24 08:28:26 UTC
Denis, Could you take a look please? Which other hardware monitoring drivers are being used on the system? The output of "sensors" should tell. I suppose it should be nct6775 + ec-sensor. At least in https://github.com/zeule/asus-ec-sensors/issues/50 has been discussed adding such board to list of supported boards. Also it looks like same issue was discussed for other board: https://github.com/zeule/asus-ec-sensors/issues/43 Also have you added some patch or module/kernel parameters? Could you please check? Thank you! Created attachment 307270 [details]
lsmod
Created attachment 307271 [details]
sensors
Created attachment 307272 [details]
kconfig
denis, nct6775 is explicitly loaded on my system - i believe i did this after our prior discussion in https://bugzilla.kernel.org/show_bug.cgi?id=204807#c302. on my current boot i don't see 'Concurrent access to the ACPI EC detected' any more, but i don't think i changed anything, so perhaps a race indeed. with asus_ec_sensors should nct6775 no longer be used? thanks, nick Generally speaking, if two drivers access the same hardware without proper locking, it's going to be racy, and this is what is being reported here. This is especially bad for devices which are accessed using a single address, data I/O port pair, and even more so when the device uses a banked register map (the same register address points to different hardware registers depending on which bank was previously selected). So to be on the safe side, it's better to use either driver but not both at the same time. Using both drivers may not bring much anyway, if they report the same sensor values. That being said, I see that the nct6775 driver already includes some code to deal with concurrent access to the monitoring device on certain Asus boards. On these boards, the nct6775 driver uses ACPI methods to access the hardware instead of direct I/O, and this is supposed to be safe. And your board (ProArt X670E-CREATOR WIFI) is in the list, so it's supposed to be already covered. Can you try enabling debugging: # echo 'file nct6775.c +p' > /sys/kernel/debug//dynamic_debug/control Then unload and reload the nct6775 driver, and check whether the following statement is being added to the kernel log: "Using Asus WMBD method of ..." I see that your board was added to the list by Denis so probably he knows better than me and I'll let him drive from here. Both of modules should be loaded as provided access to different sensors. nct6775 uses acpi methods for access to io. So it's mostly responsibility of firmware to correctly use locks, or at least we expected so. Ec-sensors module uses non wmi access to ec region, as wmi access is too slow and by testing in some cases took up to several seconds. As result module has list of mutexes for each of board that should be acquired before make any of operations, mutex name are checked manually from acpi tables before add support for a such board. By https://github.com/zeule/asus-ec-sensors/issues/43, it looks like mutex could be selected incorrectly or some other tools has tried to receive access to ec without acquiring lock used in module. I suppose need to load both of modules with enabled debug and check which operation are conflicting. I suppose Eugene knows little bit more. Could you please correct me if I have missed something? To what Denis wrote already in #9, I can add that the for the AMD 600 series ASUS changed the WMI code to read sensor reading from the EC. I did not bother to decipher it (yet?), because the direct reading from EC registers works for all models so far. I might remember we observed for one of the board models ASUS changed DSDT code with BIOS update so that the specific EC/HW monitor mutex was replaces with the global ACPI one. Further investigation requires reading the DSDT code. @Nick, could you, please, share that (please the raw dump, not decompiled one)? Created attachment 307277 [details]
acpidump -b -n DSDT -z
attached DSDT.
i struggled to get dynamic debug to work with reloading the driver - i think that using the debugfs interface doesn't work if you unload the driver (the entries for the module disappear). i had success with `modprobe nct6775 dyndbg=+p`: [72417.468361] nct6775: Using Asus WMBD method of AsusMbSwInterface to access 0xc1 chip. [72417.468431] nct6775: Found NCT6796D-S/NCT6799D-R or compatible chip at 0x2e:0x290 for asus_ec_sensors, only: [72513.918621] asus-ec-sensors asus-ec-sensors: board has 5 EC sensors that span 5 registers nothing is printed in dmesg when running `sensors`. Created attachment 307278 [details]
acpidump -b -n DSDT -z, bios 2506
i decided to update my bios to v2506, new DSDT attached. still no more messages about race.
(In reply to Nick Owens from comment #14) > still no more messages about race. The problem has resolved itself? (In reply to Artem S. Tashkinov from comment #15) > (In reply to Nick Owens from comment #14) > > still no more messages about race. > > The problem has resolved itself? unfortunately, yes. in my logs i have 11 recorded boots across 3 kernel versions (6.10.6-gentoo, 6.11.5-gentoo-r1, 6.11.5-gentoo-r1) and these messages only ever appear in one boot starting 2024-11-23, lasting for 1 day, and not appearing since. since i've not seen it at all since, and i updated my bios, and i will hopefully upgrade to 6.12 soon, i think it is safe to just close this unless it starts appearing again. not seen since i originally reported it. (In reply to Denis Pauk from comment #9) > Both of modules should be loaded as provided access to different sensors. I doubt it. If you look at the values, there's a correlation between the temperatures reported by asusec and the temperatures reported by nct6799: CPU (asusec) matches PECI/TSI Agent 0 Calibration (nct6799) CPU Package (asusec) matches TSI0_TEMP (nct6799) or Tctl (k10temp) Motherboard (asusec) matches SYSTIN (nct6799) T_Sensor (asusec) matches AUXTIN5 (nct6799) Only VRM (asusec) has no nct6799 nor k10temp match. This is also expected as asusec is only an extra software interface on top of the hardware (much like Asus' ATK0110 or Abit's µGuru 2 decades ago). Temperature values do need physical sensors on the board or CPU, and these are the values reported by nct6799 (and k10temp). The fact that the asus-ec-sensors driver once reported a concurrent access is actually a strong hint that two drivers attempted to access the same physical registers (but I did not look at the code so I can't tell more). (In reply to Jean Delvare from comment #18) > (In reply to Denis Pauk from comment #9) > > Both of modules should be loaded as provided access to different sensors. > > I doubt it. I wrote the asusec driver to get access to motherboard sensors readings unavailable via the SIO chip. For example, T_Sensor does not always present in the SIO chip, water temperatures are only available via EC registers. "Extreme" ASUS boards have even more temperature sensors. > The fact that the asus-ec-sensors driver once reported a concurrent access > is actually a strong hint that two drivers attempted to access the same > physical registers (but I did not look at the code so I can't tell more). The board firmware itself accesses EC registers. For example, one can make the chipset fan on X570 spin at full speed by looping over EC registers. That's the main reason why I put that warning message in the asusec code. |