Bug 9808 - e1000: system hung with htb QoS
Summary: e1000: system hung with htb QoS
Status: CLOSED OBSOLETE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: networking_netfilter-iptables@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-24 03:03 UTC by Kapetanakis Giannis
Modified: 2012-05-21 14:51 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.23.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
e1000-dump (108.29 KB, application/x-gzip)
2008-01-25 04:41 UTC, Kapetanakis Giannis
Details
e1000_2_dump (33.01 KB, application/x-gzip)
2008-01-25 04:56 UTC, Kapetanakis Giannis
Details

Description Kapetanakis Giannis 2008-01-24 03:03:10 UTC
Hi,

I've setup QoS on my ftp server to limit outgoing traffic. Apparently the server
stops responding (no output no keyboard) in an unpredictable manner. Sometimes it 
takes an hour, sometimes up to 4 days for the system to hung.

I have attached my QoS startup script, dmesg output,
lspci -vvv, iptables that interact with QoS.

I'm also receiving this quite often:
Jan 15 12:23:17 ftp kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jan 15 12:23:17 ftp kernel:   Tx Queue             <0>
Jan 15 12:23:17 ftp kernel:   TDH                  <2a>
Jan 15 12:23:17 ftp kernel:   TDT                  <17>
Jan 15 12:23:17 ftp kernel:   next_to_use          <17>
Jan 15 12:23:17 ftp kernel:   next_to_clean        <2a>
Jan 15 12:23:17 ftp kernel: buffer_info[next_to_clean]
Jan 15 12:23:17 ftp kernel:   time_stamp           <5798144>
Jan 15 12:23:17 ftp kernel:   next_to_watch        <2d>
Jan 15 12:23:17 ftp kernel:   jiffies              <57988ef>
Jan 15 12:23:17 ftp kernel:   next_to_watch.status <0>
Jan 15 12:23:19 ftp kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

Today for the first time (after applying options to e1000 driver in
modprobe.conf) I got a kernel panic:

BUG: unable to handle kernel paging request at virtual address a0379120
EIP: 0060: [<c05db2dc>] Not Tainted VLI
EIP is at ip_rcv+0x286/0x4ba
Kernel panic - not syncing: Fatal exception in interrupt

This is what I wrote on paper cause there wasn't logged anywhere.
Usually it hungs without a kernel panic.

System in Fedoca Core 8 up2date
2.6.23.9-85.fc8PAE
2x Intel(R) Xeon(TM) CPU 3.20GHz
4G RAM

Without the QoS loaded system never hungs. It must be related to this. However
the e1000 error I'm receiving must have to do with the e1000 driver. I've seen
this bug in the past that's why I tried to apply the options in modprobe.conf

any help will be appreciated
thanx in advance 

Giannis

QoS startup script:
# default WAN limit
LIMIT="80mbit"
LOW_LIMIT="50mbit"

start() {
        echo -n "Starting QoS: (WAN limit set to ${LIMIT})"
tc qdisc del dev eth0 root    2> /dev/null > /dev/null
tc qdisc del dev eth0 ingress 2> /dev/null > /dev/null
ADD_CLASS="tc class add dev eth0 "
###### uplink
# install root HTB, point default traffic to 1:25
tc qdisc add dev eth0 root handle 1: htb  default 25

tc class add dev eth0 parent 1: classid 1:1 htb rate 1000mbit
# class for outgoing SYN packets + Minimize-Delay TOS
${ADD_CLASS} parent 1:1 classid 1:11 htb rate 2mbit ceil 5mbit prio 1
# class for internal LAN traffic
${ADD_CLASS} parent 1:1 classid 1:12 htb rate 500mbit ceil 800mbit prio 2
# class for WAN traffic
${ADD_CLASS} parent 1:1 classid 1:2 htb rate ${LIMIT} ceil ${LIMIT} prio 3
# class for WAN http traffic
${ADD_CLASS} parent 1:2 classid 1:24 htb rate 30mbit ceil ${LIMIT} prio 4
# default class, rest WAN traffic
${ADD_CLASS} parent 1:2 classid 1:25 htb rate 20mbit ceil ${LIMIT} prio 5

tc filter add dev eth0 protocol ip parent 1:0 prio 1 handle 1 fw flowid 1:11
tc filter add dev eth0 protocol ip parent 1:0 prio 2 handle 2 fw flowid 1:12
tc filter add dev eth0 protocol ip parent 1:0 prio 4 u32 \
        match ip sport 80 0xffff flowid 1:24

tc qdisc add dev eth0 parent 1:11 handle 11: sfq perturb 10
tc qdisc add dev eth0 parent 1:12 handle 12: sfq perturb 10
tc qdisc add dev eth0 parent 1:24 handle 24: sfq perturb 10
tc qdisc add dev eth0 parent 1:25 handle 25: sfq perturb 10

        echo
}

stop() {
        echo -n "Stopping QoS: "
        tc qdisc del dev eth0 root    2> /dev/null > /dev/null
        tc qdisc del dev eth0 ingress 2> /dev/null > /dev/null
        echo
}

-------------------
QoS startup script: http://www.edu.physics.uoc.gr/~bilias/ftp/QoS
dmesg: http://www.edu.physics.uoc.gr/~bilias/ftp/dmesg
lspci -vvv: http://www.edu.physics.uoc.gr/~bilias/ftp/lspci
iptables for QoS: http://www.edu.physics.uoc.gr/~bilias/ftp/iptables

modprobe.conf options for e1000: 
options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0
Comment 1 Anonymous Emailer 2008-01-24 06:12:33 UTC
Reply-To: akpm@linux-foundation.org

> On Thu, 24 Jan 2008 03:03:11 -0800 (PST) bugme-daemon@bugzilla.kernel.org
> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=9808
> 
>            Summary: system hung with htb QoS
>            Product: Networking
>            Version: 2.5
>      KernelVersion: 2.6.23.9
>           Platform: All
>         OS/Version: Linux
>               Tree: Fedora
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Netfilter/Iptables
>         AssignedTo: networking_netfilter-iptables@kernel-bugs.osdl.org
>         ReportedBy: bilias@edu.physics.uoc.gr
> 
> 
> Hi,
> 
> I've setup QoS on my ftp server to limit outgoing traffic. Apparently the
> server
> stops responding (no output no keyboard) in an unpredictable manner.
> Sometimes
> it 
> takes an hour, sometimes up to 4 days for the system to hung.
> 
> I have attached my QoS startup script, dmesg output,
> lspci -vvv, iptables that interact with QoS.
> 
> I'm also receiving this quite often:
> Jan 15 12:23:17 ftp kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
> Hang
> Jan 15 12:23:17 ftp kernel:   Tx Queue             <0>
> Jan 15 12:23:17 ftp kernel:   TDH                  <2a>
> Jan 15 12:23:17 ftp kernel:   TDT                  <17>
> Jan 15 12:23:17 ftp kernel:   next_to_use          <17>
> Jan 15 12:23:17 ftp kernel:   next_to_clean        <2a>
> Jan 15 12:23:17 ftp kernel: buffer_info[next_to_clean]
> Jan 15 12:23:17 ftp kernel:   time_stamp           <5798144>
> Jan 15 12:23:17 ftp kernel:   next_to_watch        <2d>
> Jan 15 12:23:17 ftp kernel:   jiffies              <57988ef>
> Jan 15 12:23:17 ftp kernel:   next_to_watch.status <0>
> Jan 15 12:23:19 ftp kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
> Hang
> 
> Today for the first time (after applying options to e1000 driver in
> modprobe.conf) I got a kernel panic:
> 
> BUG: unable to handle kernel paging request at virtual address a0379120
> EIP: 0060: [<c05db2dc>] Not Tainted VLI
> EIP is at ip_rcv+0x286/0x4ba
> Kernel panic - not syncing: Fatal exception in interrupt
> 
> This is what I wrote on paper cause there wasn't logged anywhere.
> Usually it hungs without a kernel panic.
> 
> System in Fedoca Core 8 up2date
> 2.6.23.9-85.fc8PAE
> 2x Intel(R) Xeon(TM) CPU 3.20GHz
> 4G RAM
> 
> Without the QoS loaded system never hungs. It must be related to this.
> However
> the e1000 error I'm receiving must have to do with the e1000 driver. I've
> seen
> this bug in the past that's why I tried to apply the options in modprobe.conf
> 
> any help will be appreciated
> thanx in advance 
> 
> Giannis
> 
> QoS startup script:
> # default WAN limit
> LIMIT="80mbit"
> LOW_LIMIT="50mbit"
> 
> start() {
>         echo -n "Starting QoS: (WAN limit set to ${LIMIT})"
> tc qdisc del dev eth0 root    2> /dev/null > /dev/null
> tc qdisc del dev eth0 ingress 2> /dev/null > /dev/null
> ADD_CLASS="tc class add dev eth0 "
> ###### uplink
> # install root HTB, point default traffic to 1:25
> tc qdisc add dev eth0 root handle 1: htb  default 25
> 
> tc class add dev eth0 parent 1: classid 1:1 htb rate 1000mbit
> # class for outgoing SYN packets + Minimize-Delay TOS
> ${ADD_CLASS} parent 1:1 classid 1:11 htb rate 2mbit ceil 5mbit prio 1
> # class for internal LAN traffic
> ${ADD_CLASS} parent 1:1 classid 1:12 htb rate 500mbit ceil 800mbit prio 2
> # class for WAN traffic
> ${ADD_CLASS} parent 1:1 classid 1:2 htb rate ${LIMIT} ceil ${LIMIT} prio 3
> # class for WAN http traffic
> ${ADD_CLASS} parent 1:2 classid 1:24 htb rate 30mbit ceil ${LIMIT} prio 4
> # default class, rest WAN traffic
> ${ADD_CLASS} parent 1:2 classid 1:25 htb rate 20mbit ceil ${LIMIT} prio 5
> 
> tc filter add dev eth0 protocol ip parent 1:0 prio 1 handle 1 fw flowid 1:11
> tc filter add dev eth0 protocol ip parent 1:0 prio 2 handle 2 fw flowid 1:12
> tc filter add dev eth0 protocol ip parent 1:0 prio 4 u32 \
>         match ip sport 80 0xffff flowid 1:24
> 
> tc qdisc add dev eth0 parent 1:11 handle 11: sfq perturb 10
> tc qdisc add dev eth0 parent 1:12 handle 12: sfq perturb 10
> tc qdisc add dev eth0 parent 1:24 handle 24: sfq perturb 10
> tc qdisc add dev eth0 parent 1:25 handle 25: sfq perturb 10
> 
>         echo
> }
> 
> stop() {
>         echo -n "Stopping QoS: "
>         tc qdisc del dev eth0 root    2> /dev/null > /dev/null
>         tc qdisc del dev eth0 ingress 2> /dev/null > /dev/null
>         echo
> }
> 
> -------------------
> QoS startup script: http://www.edu.physics.uoc.gr/~bilias/ftp/QoS
> dmesg: http://www.edu.physics.uoc.gr/~bilias/ftp/dmesg
> lspci -vvv: http://www.edu.physics.uoc.gr/~bilias/ftp/lspci
> iptables for QoS: http://www.edu.physics.uoc.gr/~bilias/ftp/iptables
> 
> modprobe.conf options for e1000: 
> options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0
> FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0
> 
> 
> -- 
> Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
Comment 2 Jesse Brandeburg 2008-01-24 12:07:07 UTC
Andrew Morton wrote:
>> I'm also receiving this quite often:
>> Jan 15 12:23:17 ftp kernel: e1000: eth0: e1000_clean_tx_irq:
>> Detected Tx Unit Hang Jan 15 12:23:17 ftp kernel:   Tx Queue        
>> <0> 
>> Jan 15 12:23:17 ftp kernel:   TDH                  <2a>
>> Jan 15 12:23:17 ftp kernel:   TDT                  <17>
>> Jan 15 12:23:17 ftp kernel:   next_to_use          <17>
>> Jan 15 12:23:17 ftp kernel:   next_to_clean        <2a>
>> Jan 15 12:23:17 ftp kernel: buffer_info[next_to_clean]
>> Jan 15 12:23:17 ftp kernel:   time_stamp           <5798144>
>> Jan 15 12:23:17 ftp kernel:   next_to_watch        <2d>
>> Jan 15 12:23:17 ftp kernel:   jiffies              <57988ef>
>> Jan 15 12:23:17 ftp kernel:   next_to_watch.status <0>
>> Jan 15 12:23:19 ftp kernel: e1000: eth0: e1000_clean_tx_irq:
>> Detected Tx Unit Hang 

Looks like a real hardware hang.

Would you be willing to try the 7.6.15 driver at e1000.sourceforge.net,
it has many more fixes for e1000 than what is available in the in-kernel
driver.  I just posted a patch in the "Tracker/Patches" area that
patches 7.6.15 to support the e1000_dump code to dump rings when the tx
hang occurs which will help us figure out a) what software did to the
ring, b) if something is messed up in the ring which we know will hang
the hardware


>> Today for the first time (after applying options to e1000 driver in
>> modprobe.conf) I got a kernel panic:
>> 
>> BUG: unable to handle kernel paging request at virtual address
>> a0379120 
>> EIP: 0060: [<c05db2dc>] Not Tainted VLI
>> EIP is at ip_rcv+0x286/0x4ba
>> Kernel panic - not syncing: Fatal exception in interrupt
>> 
>> This is what I wrote on paper cause there wasn't logged anywhere.
>> Usually it hungs without a kernel panic.

Everyone involved will need more information about the panic to make
progress on the panic.

>> 
>> System in Fedoca Core 8 up2date
>> 2.6.23.9-85.fc8PAE
>> 2x Intel(R) Xeon(TM) CPU 3.20GHz
>> 4G RAM
>> 
>> Without the QoS loaded system never hungs. It must be related to
>> this. However the e1000 error I'm receiving must have to do with the
>> e1000 driver. I've seen this bug in the past that's why I tried to
>> apply the options in modprobe.conf 

>> modprobe.conf options for e1000:
>> options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0
>> FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0
>> TxIntDelay=0 

Please don't use any of these options unless you must, they seem to have
come from some debian forum that someone just posted a SWAG at changing
parameters that fixed him for some unknown reason.

Get back to us with the debug output, the e1000 issue can be covered on
e1000-devel@lists.sourceforge.net
Comment 3 Anonymous Emailer 2008-01-24 13:50:56 UTC
Reply-To: akpm@linux-foundation.org

> On Thu, 24 Jan 2008 12:06:58 -0800 "Brandeburg, Jesse"
> <jesse.brandeburg@intel.com> wrote:
> Would you be willing to try the 7.6.15 driver at e1000.sourceforge.net,
> it has many more fixes for e1000 than what is available in the in-kernel
> driver.  I just posted a patch in the "Tracker/Patches" area that
> patches 7.6.15 to support the e1000_dump code to dump rings when the tx
> hang occurs which will help us figure out a) what software did to the
> ring, b) if something is messed up in the ring which we know will hang
> the hardware

General hand-waving: if you have a debug patch which helps diagnose
rare problems when a tester hits them, please consider just putting
them straight into the mainline tree.

We can always take them out later when everything is resolved, and we
can help ourselves a lot this way. 
Comment 4 Kapetanakis Giannis 2008-01-25 01:37:47 UTC
(In reply to comment #2)
> Andrew Morton wrote:
> Would you be willing to try the 7.6.15 driver at e1000.sourceforge.net,
> it has many more fixes for e1000 than what is available in the in-kernel
> driver.  I just posted a patch in the "Tracker/Patches" area that
> patches 7.6.15 to support the e1000_dump code to dump rings when the tx
> hang occurs which will help us figure out a) what software did to the
> ring, b) if something is messed up in the ring which we know will hang
> the hardware

I've update my system today to kernel 2.6.23.14-107.fc8PAE.
I've also installed the 7.6.15 driver from e1000.sourceforge.net
including the dump patch (which by the way failed at the SUM file which I fixed
it manually)

driver: e1000
version: 7.6.15-NAPI_debug
firmware-version: N/A
bus-info: 0000:04:02.0

If I have a new dump I will forward it.

> >> Today for the first time (after applying options to e1000 driver in
> >> modprobe.conf) I got a kernel panic:
> >> 
> >> BUG: unable to handle kernel paging request at virtual address
> >> a0379120 
> >> EIP: 0060: [<c05db2dc>] Not Tainted VLI
> >> EIP is at ip_rcv+0x286/0x4ba
> >> Kernel panic - not syncing: Fatal exception in interrupt
> >> 
> >> This is what I wrote on paper cause there wasn't logged anywhere.
> >> Usually it hungs without a kernel panic.
> 
> Everyone involved will need more information about the panic to make
> progress on the panic.

Sure, but how? The panic is not logged anywhere and the system is in a frozen
mode. How could I catch the panic somewhere?

> >> modprobe.conf options for e1000:
> >> options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0
> >> FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0
> >> TxIntDelay=0 
> 
> Please don't use any of these options unless you must, they seem to have
> come from some debian forum that someone just posted a SWAG at changing
> parameters that fixed him for some unknown reason.
> 
> Get back to us with the debug output, the e1000 issue can be covered on
> e1000-devel@lists.sourceforge.net

I removed the options.

thanks for helping
Giannis
Comment 5 Kapetanakis Giannis 2008-01-25 04:41:26 UTC
Created attachment 14572 [details]
e1000-dump

I just got a new Tx Unit Hang.
The dump is included in the attachment.

Giannis
Comment 6 Kapetanakis Giannis 2008-01-25 04:56:41 UTC
Created attachment 14573 [details]
e1000_2_dump

An a new one right after I was starting/stopping 
my QoS services
Comment 7 Jesse Brandeburg 2008-02-01 09:37:08 UTC
Andrew Morton wrote:
>> On Thu, 24 Jan 2008 03:03:11 -0800 (PST)
>> bugme-daemon@bugzilla.kernel.org wrote:
>> http://bugzilla.kernel.org/show_bug.cgi?id=9808 
>> 
>>            Summary: system hung with htb QoS
>>            Product: Networking
>>            Version: 2.5
>>      KernelVersion: 2.6.23.9
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Fedora
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: Netfilter/Iptables
>>         AssignedTo:
>>         networking_netfilter-iptables@kernel-bugs.osdl.org
>> ReportedBy: bilias@edu.physics.uoc.gr 

This user was able to eliminate the tx hang problem with htb QoS by
disabling TSO.  We also have some reports of xen hangs when TSO is
enabled as well.  Not sure if they are related.

At this point this user has a workaround.  We will continue to try to
reproduce here to see if we can root cause the hardware issue, but we
haven't had any luck so far.
Comment 8 Jeremy Fitzhardinge 2008-02-01 11:20:32 UTC
I've had reports of device hangs under Xen due to lost events.  There's a fix in 2.6.24, but one reporter says it didn't help him, so that problem is still open.  I'd attribute device-related hangs to that rather than something TSO specific.
Comment 9 Leandro Silva 2008-07-22 13:31:09 UTC
Hello!
I have the same tc hanging problem in several servers. Sometimes it takes several days, sometimes minutes to system hangs. When i disable qos it keeps running. I have mainly kernel 2.6.23 or 2.6.25 running in a mandriva 2008.0 and 2007.1.
Any news about this problem?
Thanks,
Leandro
Comment 10 Kapetanakis Giannis 2008-07-23 02:18:03 UTC
Hi, 

By the way since I disabled TSO I've never had the problem again

ethtool -k eth0
ethtool -K eth0 tso off

regards,

Giannis
Comment 11 Jesse Brandeburg 2009-03-12 13:13:36 UTC
hm, just stumbled upon this again.
Comment 12 Jesse Brandeburg 2009-03-12 13:14:47 UTC
I have a theory that we're getting too much data in a single TSO burst and filling the tx fifo, which can cause our hardware to hang.  Still working on code to verify that this is what is actually happening.
Comment 13 Kapetanakis Giannis 2010-01-13 10:40:31 UTC
I had Tx Unit Hang on this same machine with TSO disabled.
System is Redhat 5.4, kernel 2.6.18-164.10.1.el5PAE

See my post at:
https://bugzilla.redhat.com/show_bug.cgi?id=436966#c45

Best regards,

Giannis
Comment 14 Alan 2012-05-17 15:29:39 UTC
Status ?
Comment 15 Kapetanakis Giannis 2012-05-17 15:55:39 UTC
Well, the machine has been replaced.

In advanced I'm not using any htb right now so I can't really tell.
Comment 16 Alan 2012-05-21 14:51:28 UTC
If you (or anyone else finding the bug) does find it present on a recent kernel (3.2+ as of the moment) please re-open thanks

For now I'll mark it obsolete

Note You need to log in before you can comment on or make changes to this bug.