Bug 14958 - atl1c timeouts+resets when policy routing is in use
atl1c timeouts+resets when policy routing is in use
Product: Drivers
Classification: Unclassified
Component: Network
All Linux
: P1 normal
Assigned To: drivers_network@kernel-bugs.osdl.org
Depends on:
  Show dependency treegraph
Reported: 2009-12-30 01:43 UTC by Petr Baudis
Modified: 2013-12-10 17:20 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.38
Tree: Mainline
Regression: No


Description Petr Baudis 2009-12-30 01:43:44 UTC
I have card

02:00.0 Ethernet controller: Attansic Technology Corp. Device 1063 (rev c0)

with the atl1c driver on 2.6.32-3 debian build (that includes the common-task atl1c patch). Normally, it is working fine, but I need to use simple policy routing setup:

iptables -t mangle -A OUTPUT -j CONNMARK --restore-mark
iptables -t mangle -A OUTPUT -m mark ! --mark 0 -j RETURN
iptables -t mangle -A OUTPUT -d -j RETURN
iptables -t mangle -A OUTPUT --protocol tcp --dport 25 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT --protocol tcp --sport 25 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT --protocol tcp --dport 110 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT --protocol tcp --sport 110 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT --protocol tcp --dport 993 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT --protocol tcp --sport 993 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT --protocol tcp --dport 21022 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT --protocol tcp --sport 21022 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT -j CONNMARK --save-mark

ip rule add pref 5000 fwmark 1 table msoft
ip route add default via table msoft

With this setup, the tx keeps timing out, causing the network device to reset, which also means reset of all TCP connections - in practice this happens once every 2-3 minutes; I cannot say it for certain, but it seems that this happens only in case of the marked packets - without the iptables rules, I haven't got this trouble yet, and with high policy-routed traffic, I get the device resets much more often.

This is what I get in the log the first time it happens:

[  345.000010] ------------[ cut here ]------------
[  345.000023] WARNING: at /tmp/buildd/linux-2.6-2.6.32/debian/build/source_i386_none/net/sched/sch_generic.c:261 dev_watchdog+0xbd/0x15d()
[  345.000028] Hardware name: G31M-ES2L
[  345.000031] NETDEV WATCHDOG: eth2 (atl1c): transmit queue 0 timed out
[  345.000034] Modules linked in: nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs i915 drm_kms_helper drm i2c_algo_bit video output fuse xt_tcpudp xt_MARK xt_mark nf_conntrack_ipv4 nf_defrag_ipv4 iptable_mangle xt_CONNMARK nf_conntrack ip_tables x_tables atl1e loop snd_hda_codec_realtek psmouse rng_core snd_hda_intel snd_hda_codec snd_hwdep pcspkr snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device processor i2c_i801 snd i2c_core soundcore snd_page_alloc atl1c parport_pc evdev parport serio_raw ext3 jbd mbcache sg sd_mod sr_mod cdrom crc_t10dif ata_generic ide_pci_generic ide_core uhci_hcd ata_piix libata ne2k_pci 8390 intel_agp ehci_hcd scsi_mod usbcore floppy nls_base button agpgart thermal fan thermal_sys [last unloaded: scsi_wait_scan]
[  345.000136] Pid: 0, comm: swapper Not tainted 2.6.32-trunk-686 #1
[  345.000140] Call Trace:
[  345.000148]  [<c1030a5d>] ? warn_slowpath_common+0x5e/0x8a
[  345.000154]  [<c11d5f80>] ? dev_watchdog+0x0/0x15d
[  345.000160]  [<c1030abb>] ? warn_slowpath_fmt+0x26/0x2a
[  345.000165]  [<c11d603d>] ? dev_watchdog+0xbd/0x15d
[  345.000177]  [<c10464c4>] ? hrtimer_forward+0x10c/0x124
[  345.000181]  [<c102e59d>] ? scheduler_tick+0xd3/0x1ec
[  345.000185]  [<c103b1ec>] ? run_timer_softirq+0x16a/0x1eb
[  345.000190]  [<c1035b0c>] ? __do_softirq+0xaa/0x151
[  345.000193]  [<c1035be4>] ? do_softirq+0x31/0x3c
[  345.000196]  [<c1035cba>] ? irq_exit+0x26/0x58
[  345.000201]  [<c1014b50>] ? smp_apic_timer_interrupt+0x6c/0x76
[  345.000205]  [<c1003b35>] ? apic_timer_interrupt+0x31/0x38
[  345.000210]  [<c1008ff7>] ? mwait_idle+0x62/0x6c
[  345.000213]  [<c1002388>] ? cpu_idle+0x89/0xa5
[  345.000217]  [<c13997fb>] ? start_kernel+0x307/0x30c
[  345.000220] ---[ end trace 0049f9198d0df675 ]---

Then every time the device gets reset again, I get just:

[ 1190.024970] atl1c 0000:02:00.0: irq 26 for MSI/MSI-X
[ 1190.025033] atl1c 0000:02:00.0: atl1c: eth2 NIC Link is Up<100 Mbps Full Duplex>

This is highly annoying, since the machine gets pretty much unusable for network usage with the policy routing enabled.

I will be happy to provide more debugging output or try out patches (though my experimentation possibilities are limited since the machine is 140km from me).
Comment 1 Darksurf 2011-02-01 20:46:51 UTC
this may be related to the issue I'm having! I have an AR8152 Fast Ethernet card V1.1 rev C with issues on a specific network. There is a specific network (poorly designed) that runs through 2 DNS servers.  It was a small network later added to a large network at a college. When I connect to that network it basically acts like my packages dissappear as they leave the NIC. I get a good connection for 5 seconds then I start getting 100% packet loss! 

If I ifconfig eth0 down, then ifconfig eth0 up (reset the device) then it connects and I get 5 seconds of good connection. after that its back to 100% packet loss. I'm not the only one experiencing this. My friend has an intel laptop and i have an AMD laptop, both are toshibas, both have the same NIC and both have the same exact problems. Network works fine with any other device, its just this NIC + driver combo. works fine in windows.
Comment 2 Darksurf 2011-02-01 20:53:55 UTC
Also may or may not be related to this issue when I connect a hard wire to the port, NetworkManager doesn't notice. Its as if the card has problems. If I reset the device then it connects. Sometimes it notices that its connected sometimes not. This issue also affects my friends intel laptop as well. its almost like the NIC doesn't wake when there is a wire connected ( just a description not actually the issue) I'd give you logs if I knew what to do, but dmesg shows no errors on the topic. Its not networkmanager either cause it works fine on all other devices and NICs.
Comment 3 Darksurf 2011-02-01 20:57:19 UTC
Oh, this issue also exist on all kernels from .32 -.37 dunno about .38 as its not a full release yet.
Comment 4 Jonathan Vargas 2011-06-21 12:53:54 UTC
This problem is exactly being reproduced on my environment using 2.6.38 and 2.6.39 too. 

As described by submitter, I just had to flush all chains in mangle table with iptables to make it work again.

I hope you could fix this soon because routers and gateways are unable to set rules like those ones.
Comment 5 Tache Madalin 2012-01-27 02:40:59 UTC
I can also confirm this on Ubuntu 10.04 with 2.6.38-13-server.

A fix would be highly appreciated.

Note You need to log in before you can comment on or make changes to this bug.