Bug 9102 - (sata_promise) HSM violation exceptions in combination with network load
Summary: (sata_promise) HSM violation exceptions in combination with network load
Status: REJECTED DOCUMENTED
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Mikael Pettersson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-09-30 07:39 UTC by Sebastian Witt
Modified: 2007-11-12 02:40 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.23-rc8
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Kernel log (28.92 KB, text/plain)
2007-09-30 07:42 UTC, Sebastian Witt
Details

Description Sebastian Witt 2007-09-30 07:39:15 UTC
Most recent kernel where this bug did not occur:
Distribution: Gentoo
Hardware Environment: Opteron 175, VIA K8T800Pro Host Bridge, VIA VT8237 PCI bridge [K8T800/K8T890 South], Promise PDC40718 (SATA 300 TX4) (rev 02), VIA VT6102 [Rhine-II], Intel Corporation 82541PI Gigabit Ethernet Controller, SAMSUNG HD501LJ
Software Environment:
Problem Description:

When the SATA controller is under high load in combination with network I/O, the kernel log shows the following exceptions:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
ata1.00: port_status 0x20080000
ata1.00: cmd 25/00:58:bf:d1:83/00:00:1a:00:00/e0 tag 0 cdb 0x0 data 45056 in
         res 50/00:00:16:d2:83/00:00:1a:00:00/e0 Emask 0x2 (HSM violation)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete

This occurs also on all other ata(1-4) ports (4 HD501LJ disks). The message rate is about 4/minute when transferring at maximum speed (disk & network).

This does only happen when there is (high) network traffic. Reading > 300 GB local does not trigger HSM violation messages. Starting network traffic (ping -f ...) immediately triggers the messages.

The interesting thing is: Changing the network interface from e1000 to the onboard VIA Rhine does not change this behaviour.

/proc/interrupts:

          CPU0       CPU1       
  0:         86          1   IO-APIC-edge      timer
  1:          0          8   IO-APIC-edge      i8042
  8:          0          2   IO-APIC-edge      rtc
  9:          0          0   IO-APIC-fasteoi   acpi
 14:          2       2676   IO-APIC-edge      ide0
 16:      57697         45   IO-APIC-fasteoi   eth0
 17:          0          0   IO-APIC-fasteoi   eth1
 18:          0        363   IO-APIC-fasteoi   ide2, ide3
 20:          0      93476   IO-APIC-fasteoi   sata_promise
 21:          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5
NMI:          0          0 
LOC:     537738     537580 
ERR:          0
MIS:          0

Tickless and cpufreq is disabled. Disabling SMP does not change the behaviour.
Next thing I do is changing the PCI slot from the SATA controller.

Steps to reproduce:

1. Generate heavy disk I/O
2. Wait some time to check no messages occur
3. Generate network traffic on some interface
4. HSM violation messages occur
Comment 1 Sebastian Witt 2007-09-30 07:42:11 UTC
Created attachment 13000 [details]
Kernel log
Comment 2 Tejun Heo 2007-10-02 02:29:47 UTC
cc'd Mikael Pettersson for sata_promise.
Comment 3 Sebastian Witt 2007-10-02 03:28:23 UTC
Tested different PCI slots, no change.
Also disabling PCI posted write/delayed transaction in the BIOS setup did not help (only decreasing performance).
Comment 4 Mikael Pettersson 2007-10-02 04:15:38 UTC
If you can, please try putting the Promise card + disks and the NICs in another machine with a different (preferably newer/better) chipset.

I've seen Promise SATA cards trigger the error you mentioned all by itself on some machines, while the same card/cable/disk combination works better in other machines.

At this point, I strongly suspect chipset/PCI interaction issues, though I don't know what they might be or if they can be worked around in the driver.
Comment 5 Sebastian Witt 2007-10-20 10:53:36 UTC
I put the cards in a nForce3 based board for testing, so far no messages. By the way, on the VIA based board all PCI devices were on bus 0 (chipset architecture?), on the nForce3 board the PCI slots (external PCI bus) are bus 2.
Comment 6 Sebastian Witt 2007-10-21 06:59:46 UTC
After approx. 20 hours one message showed up:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
ata1.00: port_status 0x20280000
ata1.00: cmd c8/00:10:27:c9:d1/00:00:00:00:00/e1 tag 0 cdb 0x0 data 8192 in
         res 51/40:0b:2d:c9:d1/00:00:00:00:00/e1 Emask 0xb (HSM violation)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

However I can't trigger them intentionally with disk & network load as on the other board.
Comment 7 Sebastian Witt 2007-11-10 11:04:36 UTC
Because I didn't got this message the last weeks after changing the mainboard, it's fixed for me.
Comment 8 Mikael Pettersson 2007-11-12 02:40:10 UTC
A hardware erratum in Promise 2nd-generation controllers, like the 300 TX4 mentioned in this bug report, was fixed in kernel 2.6.24-rc2.

So if you see any new errors from sata_promise, please first try a 2.6.24-rc2 or newer kernel, and please report whether the newer driver solved the problem.

Note You need to log in before you can comment on or make changes to this bug.