Bug 9102

Summary: (sata_promise) HSM violation exceptions in combination with network load
Product: IO/Storage Reporter: Sebastian Witt (se.witt)
Component: Serial ATAAssignee: Mikael Pettersson (mikpelinux)
Status: REJECTED DOCUMENTED    
Severity: normal CC: htejun, jgarzik
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.23-rc8 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Kernel log

Description Sebastian Witt 2007-09-30 07:39:15 UTC
Most recent kernel where this bug did not occur:
Distribution: Gentoo
Hardware Environment: Opteron 175, VIA K8T800Pro Host Bridge, VIA VT8237 PCI bridge [K8T800/K8T890 South], Promise PDC40718 (SATA 300 TX4) (rev 02), VIA VT6102 [Rhine-II], Intel Corporation 82541PI Gigabit Ethernet Controller, SAMSUNG HD501LJ
Software Environment:
Problem Description:

When the SATA controller is under high load in combination with network I/O, the kernel log shows the following exceptions:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
ata1.00: port_status 0x20080000
ata1.00: cmd 25/00:58:bf:d1:83/00:00:1a:00:00/e0 tag 0 cdb 0x0 data 45056 in
         res 50/00:00:16:d2:83/00:00:1a:00:00/e0 Emask 0x2 (HSM violation)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete

This occurs also on all other ata(1-4) ports (4 HD501LJ disks). The message rate is about 4/minute when transferring at maximum speed (disk & network).

This does only happen when there is (high) network traffic. Reading > 300 GB local does not trigger HSM violation messages. Starting network traffic (ping -f ...) immediately triggers the messages.

The interesting thing is: Changing the network interface from e1000 to the onboard VIA Rhine does not change this behaviour.

/proc/interrupts:

          CPU0       CPU1       
  0:         86          1   IO-APIC-edge      timer
  1:          0          8   IO-APIC-edge      i8042
  8:          0          2   IO-APIC-edge      rtc
  9:          0          0   IO-APIC-fasteoi   acpi
 14:          2       2676   IO-APIC-edge      ide0
 16:      57697         45   IO-APIC-fasteoi   eth0
 17:          0          0   IO-APIC-fasteoi   eth1
 18:          0        363   IO-APIC-fasteoi   ide2, ide3
 20:          0      93476   IO-APIC-fasteoi   sata_promise
 21:          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5
NMI:          0          0 
LOC:     537738     537580 
ERR:          0
MIS:          0

Tickless and cpufreq is disabled. Disabling SMP does not change the behaviour.
Next thing I do is changing the PCI slot from the SATA controller.

Steps to reproduce:

1. Generate heavy disk I/O
2. Wait some time to check no messages occur
3. Generate network traffic on some interface
4. HSM violation messages occur
Comment 1 Sebastian Witt 2007-09-30 07:42:11 UTC
Created attachment 13000 [details]
Kernel log
Comment 2 Tejun Heo 2007-10-02 02:29:47 UTC
cc'd Mikael Pettersson for sata_promise.
Comment 3 Sebastian Witt 2007-10-02 03:28:23 UTC
Tested different PCI slots, no change.
Also disabling PCI posted write/delayed transaction in the BIOS setup did not help (only decreasing performance).
Comment 4 Mikael Pettersson 2007-10-02 04:15:38 UTC
If you can, please try putting the Promise card + disks and the NICs in another machine with a different (preferably newer/better) chipset.

I've seen Promise SATA cards trigger the error you mentioned all by itself on some machines, while the same card/cable/disk combination works better in other machines.

At this point, I strongly suspect chipset/PCI interaction issues, though I don't know what they might be or if they can be worked around in the driver.
Comment 5 Sebastian Witt 2007-10-20 10:53:36 UTC
I put the cards in a nForce3 based board for testing, so far no messages. By the way, on the VIA based board all PCI devices were on bus 0 (chipset architecture?), on the nForce3 board the PCI slots (external PCI bus) are bus 2.
Comment 6 Sebastian Witt 2007-10-21 06:59:46 UTC
After approx. 20 hours one message showed up:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
ata1.00: port_status 0x20280000
ata1.00: cmd c8/00:10:27:c9:d1/00:00:00:00:00/e1 tag 0 cdb 0x0 data 8192 in
         res 51/40:0b:2d:c9:d1/00:00:00:00:00/e1 Emask 0xb (HSM violation)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

However I can't trigger them intentionally with disk & network load as on the other board.
Comment 7 Sebastian Witt 2007-11-10 11:04:36 UTC
Because I didn't got this message the last weeks after changing the mainboard, it's fixed for me.
Comment 8 Mikael Pettersson 2007-11-12 02:40:10 UTC
A hardware erratum in Promise 2nd-generation controllers, like the 300 TX4 mentioned in this bug report, was fixed in kernel 2.6.24-rc2.

So if you see any new errors from sata_promise, please first try a 2.6.24-rc2 or newer kernel, and please report whether the newer driver solved the problem.