Most recent kernel where this bug did not occur: Distribution: Gentoo Hardware Environment: Opteron 175, VIA K8T800Pro Host Bridge, VIA VT8237 PCI bridge [K8T800/K8T890 South], Promise PDC40718 (SATA 300 TX4) (rev 02), VIA VT6102 [Rhine-II], Intel Corporation 82541PI Gigabit Ethernet Controller, SAMSUNG HD501LJ Software Environment: Problem Description: When the SATA controller is under high load in combination with network I/O, the kernel log shows the following exceptions: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 ata1.00: port_status 0x20080000 ata1.00: cmd 25/00:58:bf:d1:83/00:00:1a:00:00/e0 tag 0 cdb 0x0 data 45056 in res 50/00:00:16:d2:83/00:00:1a:00:00/e0 Emask 0x2 (HSM violation) ata1: soft resetting port ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: configured for UDMA/133 ata1: EH complete This occurs also on all other ata(1-4) ports (4 HD501LJ disks). The message rate is about 4/minute when transferring at maximum speed (disk & network). This does only happen when there is (high) network traffic. Reading > 300 GB local does not trigger HSM violation messages. Starting network traffic (ping -f ...) immediately triggers the messages. The interesting thing is: Changing the network interface from e1000 to the onboard VIA Rhine does not change this behaviour. /proc/interrupts: CPU0 CPU1 0: 86 1 IO-APIC-edge timer 1: 0 8 IO-APIC-edge i8042 8: 0 2 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 14: 2 2676 IO-APIC-edge ide0 16: 57697 45 IO-APIC-fasteoi eth0 17: 0 0 IO-APIC-fasteoi eth1 18: 0 363 IO-APIC-fasteoi ide2, ide3 20: 0 93476 IO-APIC-fasteoi sata_promise 21: 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5 NMI: 0 0 LOC: 537738 537580 ERR: 0 MIS: 0 Tickless and cpufreq is disabled. Disabling SMP does not change the behaviour. Next thing I do is changing the PCI slot from the SATA controller. Steps to reproduce: 1. Generate heavy disk I/O 2. Wait some time to check no messages occur 3. Generate network traffic on some interface 4. HSM violation messages occur
Created attachment 13000 [details] Kernel log
cc'd Mikael Pettersson for sata_promise.
Tested different PCI slots, no change. Also disabling PCI posted write/delayed transaction in the BIOS setup did not help (only decreasing performance).
If you can, please try putting the Promise card + disks and the NICs in another machine with a different (preferably newer/better) chipset. I've seen Promise SATA cards trigger the error you mentioned all by itself on some machines, while the same card/cable/disk combination works better in other machines. At this point, I strongly suspect chipset/PCI interaction issues, though I don't know what they might be or if they can be worked around in the driver.
I put the cards in a nForce3 based board for testing, so far no messages. By the way, on the VIA based board all PCI devices were on bus 0 (chipset architecture?), on the nForce3 board the PCI slots (external PCI bus) are bus 2.
After approx. 20 hours one message showed up: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 ata1.00: port_status 0x20280000 ata1.00: cmd c8/00:10:27:c9:d1/00:00:00:00:00/e1 tag 0 cdb 0x0 data 8192 in res 51/40:0b:2d:c9:d1/00:00:00:00:00/e1 Emask 0xb (HSM violation) ata1: soft resetting port ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: configured for UDMA/133 ata1: EH complete sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA However I can't trigger them intentionally with disk & network load as on the other board.
Because I didn't got this message the last weeks after changing the mainboard, it's fixed for me.
A hardware erratum in Promise 2nd-generation controllers, like the 300 TX4 mentioned in this bug report, was fixed in kernel 2.6.24-rc2. So if you see any new errors from sata_promise, please first try a 2.6.24-rc2 or newer kernel, and please report whether the newer driver solved the problem.