Bug 177311
Summary: | crazy interrupt rate on i801_smbus | ||
---|---|---|---|
Product: | Drivers | Reporter: | Conrad Kostecki (ck+kernelbugzilla) |
Component: | Hardware Monitoring | Assignee: | Jean Delvare (jdelvare) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | andy.shevchenko, jarkko.nikula, linux, mika.westerberg, stephane.poignant |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.8.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
cat /proc/interrupts
dmesg output Debug patch for the i2c-i801 interrupts Experimental patch disabling SMB_ALERT signal 2nd version of patch disabling SMB_ALERT signal |
Description
Conrad Kostecki
2016-10-12 20:59:43 UTC
The jc42 module seems to work, as lm_sensors do find the sensors, after loading it: Galactica ~ # sensors jc42-i2c-1-19 Adapter: SMBus I801 adapter at e000 temp1: +30.8°C (low = +0.0°C) ALARM (HIGH, CRIT) (high = +0.0°C, hyst = +0.0°C) (crit = +0.0°C, hyst = +0.0°C) jc42-i2c-1-1a Adapter: SMBus I801 adapter at e000 temp1: +29.5°C (low = +0.0°C) ALARM (HIGH, CRIT) (high = +0.0°C, hyst = +0.0°C) (crit = +0.0°C, hyst = +0.0°C) jc42-i2c-1-18 Adapter: SMBus I801 adapter at e000 temp1: +27.2°C (low = +0.0°C) ALARM (HIGH, CRIT) (high = +0.0°C, hyst = +0.0°C) (crit = +0.0°C, hyst = +0.0°C) jc42-i2c-1-1b Adapter: SMBus I801 adapter at e000 temp1: +28.2°C (low = +0.0°C) ALARM (HIGH, CRIT) (high = +0.0°C, hyst = +0.0°C) (crit = +0.0°C, hyst = +0.0°C) You need to set the temperature limits correctly. Without limits, the chips will persistently generate alarms which is the likely cause of the interrupts. That won't solve the completion interrupt timeouts, though. That may be another problem. (In reply to Guenter Roeck from comment #2) > You need to set the temperature limits correctly. Without limits, the chips > will persistently generate alarms which is the likely cause of the > interrupts. > > That won't solve the completion interrupt timeouts, though. That may be > another problem. Hi! Thanks for your answer. I've gave a try and set those limits, so sensors does not show any more ALARM. Seems not to be the cause, because after settings, the interrupts are still generated massivley.. jc42-i2c-1-1b Adapter: SMBus I801 adapter at e000 RAM: +30.0°C (low = +0.0°C) (high = +80.0°C, hyst = +80.0°C) (crit = +80.0°C, hyst = +80.0°C) jc42-i2c-1-19 Adapter: SMBus I801 adapter at e000 RAM: +32.0°C (low = +0.0°C) (high = +80.0°C, hyst = +80.0°C) (crit = +80.0°C, hyst = +80.0°C) jc42-i2c-1-1a Adapter: SMBus I801 adapter at e000 RAM: +31.0°C (low = +0.0°C) (high = +80.0°C, hyst = +80.0°C) (crit = +80.0°C, hyst = +80.0°C) jc42-i2c-1-18 Adapter: SMBus I801 adapter at e000 RAM: +28.0°C (low = +0.0°C) (high = +80.0°C, hyst = +80.0°C) (crit = +80.0°C, hyst = +80.0°C) Cheers Conrad Weird, especially since the chips should not generate interrupts in the first place unless it is explicitly enabled (which the driver doesn't do, or at least shouldn't do). My wild guess is that taking the chips out of shutdown mode for some reasons enables the interrupt. Can you send the output of "i2cdump -y -f 1 0x18 w" ? Also, do the interrupts stop when you unload the driver ? Thanks, Guenter Please forget the question about the unload, as you already answered it. (In reply to Guenter Roeck from comment #4) > Weird, especially since the chips should not generate interrupts in the > first place unless it is explicitly enabled (which the driver doesn't do, or > at least shouldn't do). My wild guess is that taking the chips out of > shutdown mode for some reasons enables the interrupt. > > Can you send the output of "i2cdump -y -f 1 0x18 w" ? Here we go: ╭─root@Galactica ~ ╰─➤ i2cdump -y -f 1 0x18 w 0,8 1,9 2,a 3,b 4,c 5,d 6,e 7,f 00: ef00 0000 0005 0000 0005 c801 1f00 0182 08: 0000 0000 0000 0000 0000 0000 0000 0000 10: 0000 0000 0000 0000 0000 0000 0000 0000 18: 0000 0000 0000 0000 0000 0000 0000 0000 20: 0000 0000 0000 0000 0000 0000 0000 0000 28: 0000 0000 0000 0000 0000 0000 0000 0000 30: 0000 0000 0000 0000 0000 0000 0000 0000 38: 0000 0000 0000 0000 0000 0000 0000 0000 40: 0000 0000 0000 0000 0000 0000 0000 0000 48: 0000 0000 0000 0000 0000 0000 0000 0000 50: 0000 0000 0000 0000 0000 0000 0000 0000 58: 0000 0000 0000 0000 0000 0000 0000 0000 60: 0000 0000 0000 0000 0000 0000 0000 0000 68: 0000 0000 0000 0000 0000 0000 0000 0000 70: 0000 0000 0000 0000 0000 0000 0000 0000 78: 0000 0000 0000 0000 0000 0000 0000 0000 80: 0000 0000 0000 0000 0000 0000 0000 0000 88: 0000 0000 0000 0000 0000 0000 0000 0000 90: 0000 0000 0000 0000 0000 0000 0000 0000 98: 0000 0000 0000 0000 0000 0000 0000 0000 a0: 0000 0000 0000 0000 0000 0000 0000 0000 a8: 0000 0000 0000 0000 0000 0000 0000 0000 b0: 0000 0000 0000 0000 0000 0000 0000 0000 b8: 0000 0000 0000 0000 0000 0000 0000 0000 c0: 0000 0000 0000 0000 0000 0000 0000 0000 c8: 0000 0000 0000 0000 0000 0000 0000 0000 d0: 0000 0000 0000 0000 0000 0000 0000 0000 d8: 0000 0000 0000 0000 0000 0000 0000 0000 e0: 0000 0000 0000 0000 0000 0000 0000 0000 e8: 0000 0000 0000 0000 0000 0000 0000 0000 f0: 0000 0000 0000 0000 0000 0000 0000 0000 f8: 0000 0000 0000 0000 0000 0000 0000 0000 >Also, do the interrupts stop when you unload the driver ? No, they stop first, when I do a complete server reboot. Ah, forgot to add. Loading the old "eeprom"-module causes the same problem with the interrupts, see [1]. Maybe this is somehow connected? [1] https://bugzilla.kernel.org/show_bug.cgi?id=177291 This is an Atmel AT30TS00. Per configuration register, events are disabled, and there is no event pending, meaning it should not really be the JC42s generating the interrupts. Another question: If you only load the i801 module after boot (ie prevent the jc42 module from loading, eg by blacklisting it, but still load the i801 module), do you still get the interrupts ? Thanks, Guenter (In reply to Guenter Roeck from comment #8) > Another question: If you only load the i801 module after boot (ie prevent > the jc42 module from loading, eg by blacklisting it, but still load the i801 > module), do you still get the interrupts ? That's my current situation ;-) jc42 is only a module, which is currently not being loaded at system startup and i801 is compiled into my kernel. In such case, zero interrupts are generated on i801_smbus. Cheers Conrad #7 suggests a problem with the i801 driver and its interrupt handling. #9 contradicts that a bit, though. Maybe the C2000 has problems with interrupts, or implements it differently than handled by the driver. This may be triggered by an actual access on the bus. You could try to confirm it by running the i2cdump command after booting without the jc42 module loaded (i2cdetect -y 1 should show no reserved addresses) and see if the interrupts start happening. Thanks, Guenter (In reply to Guenter Roeck from comment #10) > #7 suggests a problem with the i801 driver and its interrupt handling. #9 > contradicts that a bit, though. > > Maybe the C2000 has problems with interrupts, or implements it differently > than handled by the driver. This may be triggered by an actual access on the > bus. You could try to confirm it by running the i2cdump command after > booting without the jc42 module loaded (i2cdetect -y 1 should show no > reserved addresses) and see if the interrupts start happening. > > Thanks, > Guenter You nail it ;-) Right after executing "i2cdump -y -f 1 0x18 w", the interrupts start massively. But jc42 wasn't loaded. Cheers Conrad Sorry, but I don't know, what do you mean here by reserved? Before/after executing i2cdump (output is the same): ╭─root@Galactica ~ ╰─➤ i2cdetect -y 1 0 1 2 3 4 5 6 7 8 9 a b c d e f 00: -- -- -- -- -- 08 -- -- -- -- -- -- -- 10: -- -- -- -- -- -- -- -- 18 19 1a 1b -- -- -- -- 20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2e -- 30: 30 31 32 33 -- -- -- -- -- -- -- -- -- -- -- -- 40: -- -- -- -- -- -- -- -- -- 49 -- -- -- -- -- -- 50: 50 51 52 53 -- -- -- -- -- -- -- -- -- -- -- -- 60: -- 61 -- -- -- -- -- -- -- 69 -- -- 6c -- -- -- 70: -- -- -- -- -- -- -- -- A simple "i2cdetect -y 1" also triggers the interrupts. With "reserved" I meant "a driver for a chip is loaded". After you load the jc42 driver (or the eeprom driver), you'll see that some of the addresses show up as "UU". Anyway, I think the conclusion is that the i801 driver has problems with interrupt support on your hardware, as I suspected in #10. Issue #177291 is really the same problem. Jean maintains that driver as well, so he should be able to help. (In reply to Guenter Roeck from comment #13) > With "reserved" I meant "a driver for a chip is loaded". After you load the > jc42 driver (or the eeprom driver), you'll see that some of the addresses > show up as "UU". Ah I see. Yes, after loading jc42, I can see "UU". ╭─root@Galactica ~ ╰─➤ i2cdetect -y 1 0 1 2 3 4 5 6 7 8 9 a b c d e f 00: -- -- -- -- -- 08 -- -- -- -- -- -- -- 10: -- -- -- -- -- -- -- -- UU UU UU UU -- -- -- -- 20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2e -- 30: 30 31 32 33 -- -- -- -- -- -- -- -- -- -- -- -- 40: -- -- -- -- -- -- -- -- -- 49 -- -- -- -- -- -- 50: 50 51 52 53 -- -- -- -- -- -- -- -- -- -- -- -- 60: -- 61 -- -- -- -- -- -- -- 69 -- -- 6c -- -- -- 70: -- -- -- -- -- -- -- -- > Anyway, I think the conclusion is that the i801 driver has problems with > interrupt support on your hardware, as I suspected in #10. Issue #177291 is > really the same problem. Jean maintains that driver as well, so he should be > able to help. Should I close #177291 as a duplicate, as it's mine ticket. Thanks for your support. Hope, Jean has an idea :) Thanks Guenter for stepping in. I always suspected the problem was with the SMBus controller (i2c-i801 driver) and I intended to comment about it long ago but then forgot, sorry about that :-( Conrad, I need detailed information about the SMBus PCI devices and the IRQs on your machine. Please attach the output of: $ /sbin/lspci -nn | grep SMBus $ /sbin/lspci -xxx -s <device> (for each device listed above) $ cat /proc/interrupts Also look for any message related to i2c, SMBus, i801 or the PCI devices above in the kernel logs. Hello Jean! (In reply to Jean Delvare from comment #16) > $ /sbin/lspci -nn | grep SMBus 00:13.0 System peripheral [0880]: Intel Corporation Atom processor C2000 SMBus 2.0 [8086:1f15] (rev 02) 00:1f.3 SMBus [0c05]: Intel Corporation Atom processor C2000 PCU SMBus [8086:1f3c] (rev 02) > $ /sbin/lspci -xxx -s <device> > (for each device listed abov ╭─root@Galactica /home/kostecki ╰─➤ lspci -xxx -s 00:13.0 00:13.0 System peripheral: Intel Corporation Atom processor C2000 SMBus 2.0 (rev 02) 00: 86 80 15 1f 46 05 10 00 02 00 80 08 00 00 00 00 10: 04 40 f1 ff 0f 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 20 08 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00 40: 10 80 92 00 01 80 00 10 20 08 04 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 01 8c 03 00 00 00 00 00 00 00 00 00 05 00 81 01 90: 0c f0 ef fe 00 00 00 00 a6 41 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 01 00 10 00 10 80 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ╭─root@Galactica /home/kostecki ╰─➤ lspci -xxx -s 00:1f.3 00:1f.3 SMBus: Intel Corporation Atom processor C2000 PCU SMBus (rev 02) 00: 86 80 3c 1f 43 01 98 02 02 00 05 0c 00 00 00 00 10: 00 00 50 df 00 00 00 00 00 00 00 00 00 00 00 00 20: 01 e0 00 00 00 00 00 00 00 00 00 00 d9 15 20 08 30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 02 00 00 40: 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 03 04 04 00 00 00 08 08 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 0f 02 01 03 03 03 00 > $ cat /proc/interrupts See attachment. > Also look for any message related to i2c, SMBus, i801 or the PCI devices > above in the kernel logs. ╭─root@Galactica / ╰─➤ dmesg|grep -i smbus [ 7.968653] i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143) [ 7.970338] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt [ 7.974068] ismt_smbus 0000:00:13.0: enabling device (0140 -> 0142) [ 974.471917] ismt_smbus 0000:00:13.0: completion wait timed out [ 975.512022] ismt_smbus 0000:00:13.0: completion wait timed out [ 976.552097] ismt_smbus 0000:00:13.0: completion wait timed out [ 977.592124] ismt_smbus 0000:00:13.0: completion wait timed out [ 978.632168] ismt_smbus 0000:00:13.0: completion wait timed out [ 979.682207] ismt_smbus 0000:00:13.0: completion wait timed out [ 980.712251] ismt_smbus 0000:00:13.0: completion wait timed out [ 981.752310] ismt_smbus 0000:00:13.0: completion wait timed out The timeout messages are only shown, when I do load jc42. I am also attaching my whole dmesg. Cheers Conrad Created attachment 246221 [details]
cat /proc/interrupts
Created attachment 246231 [details]
dmesg output
Can you blacklist ismt-msi, reboot and see if it makes any difference? (In reply to Jean Delvare from comment #20) > Can you blacklist ismt-msi, reboot and see if it makes any difference? No, didn't changed anything. I've compiled a new kernel without ismt-msi (CONFIG_I2C_ISMT=n) and still after loading jc42 interrupts go very high. OK, thanks. I have added Intel folks to Cc. I can't find the register descriptions for the Atom C2000 SMBus function, so there's not so much I can do. Conrad, support for the SMBus in this CPU family was added several years ago to the i2c-i801 driver, so I am wondering why this bug is only reported now. Is this new hardware for you? Or you have it for some time, and it was working fine so far, and broke with a kernel or OS update? I found some datasheet through Avoton C2750 http://ark.intel.com/products/77987/Intel-Atom-Processor-C2750-4M-Cache-2_40-GHz -> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/datasheets/atom-c2000-microserver-datasheet.pdf I guess both C2758 and C2750 are compatible as they are listed in C2000 Product Family for Communications. (In reply to Jean Delvare from comment #22) > Is this new hardware for you? Or you have it for some time, and it was > working fine so far, and broke with a kernel or OS update? Yes, this is new hardware. I bought it a few weeks before starting this ticket. So I can't tell, if it was working before. (In reply to Jarkko Nikula from comment #23) > I found some datasheet through Avoton C2750 > http://ark.intel.com/products/77987/Intel-Atom-Processor-C2750-4M-Cache-2_40- > GHz > -> > https://www-ssl.intel.com/content/dam/www/public/us/en/documents/datasheets/ > atom-c2000-microserver-datasheet.pdf > > I guess both C2758 and C2750 are compatible as they are listed in C2000 > Product Family for Communications. C2750 is with turbo boost, C2758 has instead of turbo boost a quickassist accelerator. (Don't know, if this makes a difference for the register) Jarkko, I found the same document, however it doesn't appear to contain register definitions, or I am blind. (In reply to Jean Delvare from comment #25) > Jarkko, I found the same document, however it doesn't appear to contain > register definitions, or I am blind. Maybe chapter 15.8 and 18.5? Sorry, if that's wrong, as I don't know, if that's, what you are searching? Problem is that only the register addresses are provided, not the register definitions. Sure, there is a status register, and we know its address, but we don't know how the bits are defined and if they are defined exactly like in other Intel CPUs. With the C2000 being a different micro-architecture than the "mainline" Intel CPUs, there is a real possibility that the register definitions are different. Sorry, I looked at it too quickly. Indeed definitions are missing. I'll ask http://ark.intel.com/ is there more detailed datasheet available. Conrad, until we sort it out, you may be able to work around the problem by passing option disable_features=0x10 to the i2c-i801 driver. (In reply to Jean Delvare from comment #29) > Conrad, until we sort it out, you may be able to work around the problem by > passing option disable_features=0x10 to the i2c-i801 driver. Hey Jean, seems to help as a workaround after disabling the interrupts for i2c-i801. [ 7.950079] i801_smbus 0000:00:1f.3: Interrupt disabled by user [ 7.951624] i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143) [ 7.953270] i801_smbus 0000:00:1f.3: SMBus using polling Cheers Conrad *** Bug 177291 has been marked as a duplicate of this bug. *** Any news for me? :) Jarkko, were you able to get your hands on a datasheet? It doesn't need to be public, if you can check the register definitions for us. I got one contact info back in December but no response. Maybe busy before holidays and I forgot to ping again. I'll ask again. (In reply to Jarkko Nikula from comment #34) > I got one contact info back in December but no response. Maybe busy before > holidays and I forgot to ping again. I'll ask again. Did you got any reply? Just only out of office reply back in March but pinged again now. (In reply to Jarkko Nikula from comment #36) > Just only out of office reply back in March but pinged again now. And now? ;-) Hmm... Seems this one gets somehow abandoned. Jarkko, any news on this? Same question to Conrad, do you have any luck with v5.11 based kernels (or closer to latest)? (In reply to Andy Shevchenko from comment #38) > Hmm... Seems this one gets somehow abandoned. Jarkko, any news on this? Same > question to Conrad, do you have any luck with v5.11 based kernels (or closer > to latest)? Nope. No news. Problem still exists with latest kernel. Unfortunately I don't have any updates on this. This bug gives me an idea to try MSI on i801, but it appears that there is none of the platforms that have MSI capability on this device. Not sure if it's usable information, but I think it's better to share it anyway. Not sure that's completely related, but would assume at least partially. I have two mini-servers, one with a Supermicro A2SDi-8C-HLN4F (Atom C3758), and the other one with an older Supermicro A1SRM-2758F (Atom C2758F). I upgraded both from Debian Buster (kernel 4.19.194-3) to Bullseye (5.10.46-5). No issue on the C3758, but i was faced with severe performance regression on the C2758F. When running 5.10 on the C2758F, /proc/interrupts shows about 100k interrupts per second for 'IO-APIC 18-fasteoi i801_smbus', and overall performance suffers a lot (e.g. iperf between two KVM virtual machines bridged together is 93% slower with 5.10 than with 4.19). So far i was getting around the issue by blocklisting i2c_i801. After i found this, i tried adding the disable_features=0x10 option, and that worked too. I'm not using jc42 at all, sensors thresholds are set to correct values by the distro tools. # i2cdetect -l # sensors nvme-pci-0400 Adapter: PCI adapter Composite: +30.9°C (low = -273.1°C, high = +84.8°C) (crit = +84.8°C) Sensor 1: +30.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +31.9°C (low = -273.1°C, high = +65261.8°C) coretemp-isa-0000 Adapter: ISA adapter Core 0: +48.0°C (high = +98.0°C, crit = +98.0°C) Core 1: +48.0°C (high = +98.0°C, crit = +98.0°C) Core 2: +48.0°C (high = +98.0°C, crit = +98.0°C) Core 3: +48.0°C (high = +98.0°C, crit = +98.0°C) Core 4: +47.0°C (high = +98.0°C, crit = +98.0°C) Core 5: +46.0°C (high = +98.0°C, crit = +98.0°C) Core 6: +47.0°C (high = +98.0°C, crit = +98.0°C) Core 7: +47.0°C (high = +98.0°C, crit = +98.0°C) # dmesg | egrep -i '(smbus|i801)' [ 2.226240] ismt_smbus 0000:00:13.0: enabling device (0000 -> 0002) [ 2.229927] i801_smbus 0000:00:1f.3: enabling device (0000 -> 0003) [ 2.230089] i801_smbus 0000:00:1f.3: SPD Write Disable is set [ 2.230136] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt ~# lspci -nn | grep SMBus 00:13.0 System peripheral [0880]: Intel Corporation Atom processor C2000 SMBus 2.0 [8086:1f15] (rev 03) 00:1f.3 SMBus [0c05]: Intel Corporation Atom processor C2000 PCU SMBus [8086:1f3c] (rev 03) # lspci -xxx -s 00:13.0 00:13.0 System peripheral: Intel Corporation Atom processor C2000 SMBus 2.0 (rev 03) 00: 86 80 15 1f 06 04 10 00 03 00 80 08 00 00 00 00 10: 04 70 31 df 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 20 08 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00 40: 10 80 92 00 01 80 00 10 20 08 04 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 01 8c 03 00 00 00 00 00 00 00 00 00 05 00 81 01 90: 04 00 e4 fe 00 00 00 00 21 40 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 01 00 10 00 10 80 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # lspci -xxx -s 00:1f.3 00:1f.3 SMBus: Intel Corporation Atom processor C2000 PCU SMBus (rev 03) 00: 86 80 3c 1f 03 00 98 02 03 00 05 0c 00 00 00 00 10: 00 40 31 df 00 00 00 00 00 00 00 00 00 00 00 00 20: 01 e0 00 00 00 00 00 00 00 00 00 00 d9 15 20 08 30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 02 00 00 40: 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 03 04 04 00 00 00 08 08 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 0f 02 01 03 03 03 00 Yes, this is the same problem here. But Intel doesn't seem to be interessted here :-( (In reply to stephane.poignant from comment #42) > I upgraded both from Debian Buster (kernel 4.19.194-3) to Bullseye > (5.10.46-5). No issue on the C3758, but i was faced with severe performance > regression on the C2758F. > Interesting, so was the 4.19 working on the C2758F without interrupt storm? (In reply to Jarkko Nikula from comment #44) > (In reply to stephane.poignant from comment #42) > > I upgraded both from Debian Buster (kernel 4.19.194-3) to Bullseye > > (5.10.46-5). No issue on the C3758, but i was faced with severe performance > > regression on the C2758F. > > > Interesting, so was the 4.19 working on the C2758F without interrupt storm? I haven't checked the /proc/interrupts when running 4.19 so i cannot tell for sure that the interrupts were not there. The performance regression was not there for sure. I can check this in a couple of weeks (server at a remote location with no oobm network). Dmesg when running 4.19 shows it had interrupts enabled: [ 0.000000] Linux version 4.19.0-17-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.194-3 (2021-07-18) [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.19.0-17-amd64 root=/dev/mapper/vg1--hrbpsrv01-h--hrbpsrv01 ro quiet rd.luks.options=discard ... [ 1.434097] Run /init as init process [ 1.782787] dca service started, version 1.12.1 [ 1.783203] ismt_smbus 0000:00:13.0: enabling device (0000 -> 0002) [ 1.796694] cryptd: max_cpu_qlen set to 1000 [ 1.801177] i801_smbus 0000:00:1f.3: enabling device (0000 -> 0003) [ 1.801317] i801_smbus 0000:00:1f.3: SPD Write Disable is set [ 1.801356] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt [ 1.805199] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k [ 1.805202] igb: Copyright (c) 2007-2014 Intel Corporation. [ 1.805246] igb 0000:00:14.0: enabling device (0000 -> 0002) [ 1.816722] SSE version of gcm_enc/dec engaged. ... The problem do persists in kernel 4.19 and other versions. It only depens, if a different driver triggers the interrupts. If so, they are counting very high. So it's possible, that you had none driver in 4.19 using those interrupts and as a consequence, the bug did not trigger. @Jarkko Nikula: Since you are still replying, could you please try again and further to get the needed docs, as requested by Jean Delvare? @Conrad Kostecki: Yeah, I agree with you it's unlikely problem was absent in 4.19 as it was present way before it. I was in contact with our sales support and they told the Atom C2758 with F-postfix is custom to SuperMicro. Unfortunately they didn't find explicit specification for the SMBus controller on it but they told it's based on the same 22 nm Silvermonth architecture than the Bay Trail. I suppose SMBus IO should be compatible. Unfortunately public datasheets for Bay Trails seems scarce too but I was able to find something when searching datasheets for the Bay Trail E3825 used in MinnowBoard Max. Following document seems to be available for the registered ark.intel.com user or by search engines: "Intel Atom ® Processor E3800 Product Family" with Document Number: 538136 and Chapter 33 "PCU – System Management Bus (SMBus)" Created attachment 299193 [details]
Debug patch for the i2c-i801 interrupts
Could you try attached patch what interrupt statuses it will print in case of interrupt storm? It's rate limited debug print so it shouldn't flood the dmesg. You need to have CONFIG_DYNAMIC_DEBUG=y in your kernel config and either enable the debug print in runtime by following: mount none /sys/kernel/debug -t debugfs echo -n "func i801_isr +p" >/sys/kernel/debug/dynamic_debug/control or by appending that to your kernel command line: i2c_i801.dyndbg="func i801_isr +p" Here is the output: pcicst 0x298, SMBHSTSTS 0x60 [ 359.205884] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 359.205918] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210031] i801_isr: 375367 callbacks suppressed [ 364.210043] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210085] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210126] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210142] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210178] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210217] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210234] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210253] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210292] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 364.210329] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220035] i801_isr: 380909 callbacks suppressed [ 369.220047] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220069] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220109] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220146] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220185] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220222] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220262] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220278] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220317] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 369.220333] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230078] i801_isr: 393736 callbacks suppressed [ 374.230109] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230151] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230191] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230210] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230248] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230283] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230297] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230332] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230345] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 374.230358] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240037] i801_isr: 382705 callbacks suppressed [ 379.240068] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240090] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240110] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240130] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240150] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240186] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240205] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240242] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240281] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 379.240297] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250032] i801_isr: 387109 callbacks suppressed [ 384.250043] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250065] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250104] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250141] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250181] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250197] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250216] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250255] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250292] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 384.250311] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 $ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 18: 0 0 0 26596692 0 0 0 0 IO-APIC 18-fasteoi i801_smbus Thanks. Those debug prints confirm the interrupt is really coming from the SMBus controller (bit 3 is set in PCI status) and the SMB alert bit is set. Created attachment 299201 [details]
Experimental patch disabling SMB_ALERT signal
@Conrad Kostecki: Could you try does the attached experimental patch which disables the SMB_ALERT help here. Thanks for the follow up, i will test the patch on my setup as well by next week. I just tested the patch and can confirm, it works. After applying patch, interrupts dropped nearly to zero on i801_smbus. (In reply to Conrad Kostecki from comment #55) > I just tested the patch and can confirm, it works. After applying patch, > interrupts dropped nearly to zero on i801_smbus. According to the specification the host (if implemented ALERT) should issue special byte read command to see which device wants to send something. If the proper implementation won't fix this, it might be some pin configuration issue (like pull down sitting on the respective pin) or PCB or firmware (BIOS) issues. Would be nice to understand, if it can be done without much efforts, what's exactly is making the ALERT be asserted. I was thinking too should there be proper acknowledging for the SMB_ALERT but since the driver currently doesn't have support for it I wanted to see does simple disabling help. Fortunately I was able to reproduce issue locally in an another platform where the SMB_ALERT was connected to a resistor and was able to pull-down the signal by a wire. Interrupt storm begins when the SMB_ALERT goes down for a moment and continues after. I'll test a bit more and make a proper patch. One thing I'm wondering should the driver restore the original disable status on driver removal like what is done for host notify in i801_disable_host_notify(). Created attachment 299217 [details]
2nd version of patch disabling SMB_ALERT signal
I moved the SMB_ALERT signal disabling to i801_enable_host_notify() since the SMBSLVCMD register is available on ICH3 and later. Also it keeps the original value prior to driver load.
(In reply to Jarkko Nikula from comment #58) > 2nd version of patch disabling SMB_ALERT signal Side remark: Looking into this code, shouldn't you first clean current notifications and only after that enable IRQ? Patch v2 works for me. Interrupts still are fine and do not go crazy. I can confirm that i am getting the same results with the two patches on my setup with the Debian kernels. Debug patch produces the same messages, and with SMB_ALERT disable patch there was no longer any interrupt triggered. Also when booting into the previous kernel i was using (linux-image-4.19.0-17-amd64 4.19.194-3), the module loads with the default config but i am not getting any interrupt. So for my particular setup the issue only appeared after upgrading from Debian kernel 4.19 to 5.10. Will test the second version of the patch ASAP and provide you with the results. ## Kernel 4.16 # uname -a Linux hrbpsrv01.intra.lan 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux # cat /proc/interrupts | grep i801 18: 0 0 0 0 0 0 0 0 IO-APIC 18-fasteoi i801_smbus # dmesg ... [ 6652.023634] i801_smbus 0000:00:1f.3: SPD Write Disable is set [ 6652.023689] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt ... ## Debian linux-image-5.10.0-9-amd64 (5.10.70-1) + Debug patch # uname -a Linux hrbpsrv01.intra.lan 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux # cat /proc/interrupts | grep i801 18: 0 0 0 0 0 7358862 0 0 IO-APIC 18-fasteoi i801_smbus (increase at about 100k interrupts/sec) # dmesg ... [ 516.429120] i801_smbus 0000:00:1f.3: SPD Write Disable is set [ 516.429140] i801_smbus 0000:00:1f.3: An interrupt is pending! [ 516.429161] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt [ 516.429933] i2c i2c-1: 4/4 memory slots populated (from DMI) [ 516.430337] at24 1-0050: supply vcc not found, using dummy regulator [ 516.431043] at24 1-0050: 256 byte spd EEPROM, read-only [ 516.431078] i2c i2c-1: Successfully instantiated SPD at 0x50 [ 516.431455] at24 1-0051: supply vcc not found, using dummy regulator [ 516.432148] at24 1-0051: 256 byte spd EEPROM, read-only [ 516.432174] i2c i2c-1: Successfully instantiated SPD at 0x51 [ 516.432576] at24 1-0052: supply vcc not found, using dummy regulator [ 516.433284] at24 1-0052: 256 byte spd EEPROM, read-only [ 516.433325] i2c i2c-1: Successfully instantiated SPD at 0x52 [ 516.433748] at24 1-0053: supply vcc not found, using dummy regulator [ 516.434454] at24 1-0053: 256 byte spd EEPROM, read-only [ 516.434497] i2c i2c-1: Successfully instantiated SPD at 0x53 [ 525.513104] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513133] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513161] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513185] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513209] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513234] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513258] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513281] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513316] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 525.513352] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514207] i801_isr: 297603 callbacks suppressed [ 530.514221] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514259] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514299] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514331] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514366] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514391] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514425] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514457] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514482] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 530.514507] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518261] i801_isr: 320308 callbacks suppressed [ 535.518273] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518311] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518337] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518362] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518386] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518415] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518442] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518467] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518491] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 [ 535.518516] i801_smbus 0000:00:1f.3: pcicst 0x298, SMBHSTSTS 0x60 ... ## Kernel 5.10 + Disable ALRM interrupt patch # cat /proc/interrupts | grep i801 18: 0 0 0 0 0 10567596 0 0 IO-APIC 18-fasteoi i801_smbus (no longer increase) # dmesg ... [ 664.110013] i801_smbus 0000:00:1f.3: SPD Write Disable is set [ 664.110065] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt [ 664.111975] i2c i2c-1: 4/4 memory slots populated (from DMI) [ 664.112460] at24 1-0050: supply vcc not found, using dummy regulator [ 664.113195] at24 1-0050: 256 byte spd EEPROM, read-only [ 664.113240] i2c i2c-1: Successfully instantiated SPD at 0x50 [ 664.113657] at24 1-0051: supply vcc not found, using dummy regulator [ 664.114374] at24 1-0051: 256 byte spd EEPROM, read-only [ 664.114412] i2c i2c-1: Successfully instantiated SPD at 0x51 [ 664.114823] at24 1-0052: supply vcc not found, using dummy regulator [ 664.116794] at24 1-0052: 256 byte spd EEPROM, read-only [ 664.116838] i2c i2c-1: Successfully instantiated SPD at 0x52 [ 664.117288] at24 1-0053: supply vcc not found, using dummy regulator [ 664.118042] at24 1-0053: 256 byte spd EEPROM, read-only [ 664.118092] i2c i2c-1: Successfully instantiated SPD at 0x53 Patch V2 works for me too. # cat /proc/interrupts | grep i801 18: 0 0 0 0 0 8 0 0 IO-APIC 18-fasteoi i801_smbus (In reply to Andy Shevchenko from comment #59) > (In reply to Jarkko Nikula from comment #58) > > 2nd version of patch disabling SMB_ALERT signal > > Side remark: Looking into this code, shouldn't you first clean current > notifications and only after that enable IRQ? That's a good question and made me debugging more. In fact disabling doesn't disable detection and SMBALERT_STS will be set and cause short burst of interrupts during driver load and unload time if SMB_ALERT signal was asserted. Looks like it's better to add basic acknowledging for it into i801_isr(). I'm not sure would clearing pending interrupts at the probe time cause any regression but acknowledging the SMBALERT_STS in i801_isr() makes sure the status doesn't stay forever if it occurs after probe. (In reply to Jarkko Nikula from comment #63) > (In reply to Andy Shevchenko from comment #59) > > (In reply to Jarkko Nikula from comment #58) > > > 2nd version of patch disabling SMB_ALERT signal > > > > Side remark: Looking into this code, shouldn't you first clean current > > notifications and only after that enable IRQ? > > That's a good question and made me debugging more. In fact disabling doesn't > disable detection and SMBALERT_STS will be set and cause short burst of > interrupts during driver load and unload time if SMB_ALERT signal was > asserted. Looks like it's better to add basic acknowledging for it into > i801_isr(). > > I'm not sure would clearing pending interrupts at the probe time cause any > regression but acknowledging the SMBALERT_STS in i801_isr() makes sure the > status doesn't stay forever if it occurs after probe. It also makes sense to test it with DEBUG_SHIRQ enabled (yes, I know that more than a half of the drivers in the Linux kernel will either crash or behave badly on this, not many developers know about the debugging feature). This bug is believed to be fixed in kernel v5.16 by the following 2 commits: commit 03a976c9afb5e3c4f8260c6c08a27d723b279c92 Author: Jarkko Nikula Date: Wed Nov 17 11:45:09 2021 +0200 i2c: i801: Fix interrupt storm from SMB_ALERT signal commit 9b5bf5878138293fb5b14a48a7a17b6ede6bea25 Author: Jean Delvare Date: Tue Nov 9 16:02:57 2021 +0100 i2c: i801: Restore INTREN on unload Upgraded to kernel 5.16 today no more irq noise. Thank you! |