Bug 8638

Summary:	unregister_netdevice: waiting for ppp0 to become free. pppoe + multihome + htb qos?
Product:	Networking	Reporter:	Trevor Cordes (kernelbugs)
Component:	Netfilter/Iptables	Assignee:	Stephen Hemminger (stephen)
Status:	RESOLVED DUPLICATE
Severity:	high
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.20-1.2316.fc5	Subsystem:
Regression:	---	Bisected commit-id:
Attachments:	annotated log excerpts showing 4 instances of bug hitting

Description Trevor Cordes 2007-06-16 03:14:41 UTC

Most recent kernel where this bug did not occur: has occurred since at least 2.6.18-1.2200.fc5 (Sep 2005) but could have been in earlier versions as I wasn't then using the tecnology I believe triggers the bug
Distribution: FC5
Hardware Environment: x86 P4 UP 512MB
Software Environment: lots of cutting-edge (but stock kernel) networking technology
Problem Description:

Every few months on 1 box I administer:
kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
system gets very locked up (but often not completely, no panics) and won't reboot: requires onsite hard reset. In fact, most reboot attempts will fail even before the bug hits as a reboot will trigger the bug. I always reboot the box with reboot -f now when I'm remote.

I have a dozen extremely similar boxes to this buggy one out there and they don't show this bug. Unique to this box and I think relevant to the bug:

1) 2 PPPoE DSL connections (multihomed, 2 IP addresses, traffic split by port, used to achieve higher aggregate upload bandwidth)
2) multi-table ip route rules ("ip rule add ... table 2") to achieve traffic splitting in #1.

Other technologies combined on this box but not on any others (though others use them separately without the bug hitting):

3) QoS, HTB qdiscs (used on non-PPPoE boxes without the bug)
4) 2.6sec IPSEC VPN (used on many other PPPoE and non-PPPoE boxes without problems)
5) PPPoE (used on many other boxes without this bug)

I'm not even sure where to begin on what info to provide. I can provide my config for any of the above technologies if it will help. The box is an important production box and unless I can find a way to reliably make it barf while onsite it may be hard to test things, like "turn off QoS", because all the tecnologies are essential for day to day operations.

I'll attach a useful log excerpt from the last 4 times the bug hit if I can.

If this is a bad bug entry, please tell me what I need to add. It's my first entry on this bugzilla and I'm not sure what's required. I'm sorry this bug report is on the FC5 stock kernels, but I'm not sure I can use a "vanilla" kernel instead of FC5 and not screw something up. However, there are NO binary modules or any weird stuff on the box. It's all stock FC5 rpms.

This box is a production box and the only one I have with 2 PPPoE connections to test. I'm nearly positive it's either a 2-PPPoE+advanced-routing problem or a 2-PPPoE+HTB problem. Since I've seen no other hits on google or elsewhere that are exactly like this bug, I must assume it's something fairly unique to this box: but what combination?!

I've had a Redhat bugzilla open on this since Sep 2005 with zero replies! It shows more detail and my thought process over the years.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169502

Steps to reproduce:
Haven't figured out a way to reliably hit this bug. Any hints to allow easier testing (which must be done onsite) are welcome.

Comment 1 Trevor Cordes 2007-06-16 03:15:53 UTC

Created attachment 11764 [details]
annotated log excerpts showing 4 instances of bug hitting

Comment 2 Trevor Cordes 2007-06-16 03:19:35 UTC

Darn.  I forgot the most important technology on the box:

6) iptables: over 400 rules.  Same rules used on my other boxes without problems, but this box has special rules for the multihoming and QoS.  No -j QUEUE rules as some other google hits indicate may cause bugs like this.

Comment 3 Trevor Cordes 2007-06-16 03:24:52 UTC

http://groups.google.com/group/linux.debian.kernel/browse_thread/thread/dcb36b5fe827fad6/05445a30be147608

indicates the bug may indeed be QoS-related.  If they fixed it in CBQ qdisc, perhaps the same bug is in HTB qdisc but not yet fixed because HTB is relatively obscure compared to CBQ?  I don't use CBQ because it is overly complex for what I want to do and HTB is way easier to wrap your head around.

If someone knows how to easily convert my simple HTB setup to CBQ I suppose running that for a few days/months would be a good test.  Doesn't solve the bug though!

Comment 4 Andrew Morton 2007-06-16 08:38:44 UTC

Subject: Re: [Bugme-new]  New: unregister_netdevice: waiting for
 ppp0 to become free. pppoe + multihome + htb qos?

On Sat, 16 Jun 2007 03:11:30 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=8638
> 
>            Summary: unregister_netdevice: waiting for ppp0 to become free.
>                     pppoe + multihome + htb qos?
>            Product: Networking
>            Version: 2.5
>      KernelVersion: 2.6.20-1.2316.fc5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Netfilter/Iptables
>         AssignedTo: networking_netfilter-iptables@kernel-bugs.osdl.org
>         ReportedBy: kernelbugs@tecnopolis.ca
> 
> 
> Most recent kernel where this bug did not occur: has occurred since at least
> 2.6.18-1.2200.fc5 (Sep 2005) but could have been in earlier versions as I
> wasn't then using the tecnology I believe triggers the bug
> Distribution: FC5
> Hardware Environment: x86 P4 UP 512MB
> Software Environment: lots of cutting-edge (but stock kernel) networking
> technology
> Problem Description:
> 
> Every few months on 1 box I administer:
> kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count =
> 1
> system gets very locked up (but often not completely, no panics) and won't
> reboot: requires onsite hard reset.  In fact, most reboot attempts will fail
> even before the bug hits as a reboot will trigger the bug.  I always reboot
> the
> box with reboot -f now when I'm remote.
> 
> I have a dozen extremely similar boxes to this buggy one out there and they
> don't show this bug.  Unique to this box and I think relevant to the bug:
> 
> 1) 2 PPPoE DSL connections (multihomed, 2 IP addresses, traffic split by
> port,
> used to achieve higher aggregate upload bandwidth)
> 2) multi-table ip route rules ("ip rule add ... table 2") to achieve traffic
> splitting in #1.
> 
> Other technologies combined on this box but not on any others (though others
> use them separately without the bug hitting):
> 
> 3) QoS, HTB qdiscs (used on non-PPPoE boxes without the bug)
> 4) 2.6sec IPSEC VPN (used on many other PPPoE and non-PPPoE boxes without
> problems)
> 5) PPPoE (used on many other boxes without this bug)
> 
> I'm not even sure where to begin on what info to provide.  I can provide my
> config for any of the above technologies if it will help.  The box is an
> important production box and unless I can find a way to reliably make it barf
> while onsite it may be hard to test things, like "turn off QoS", because all
> the tecnologies are essential for day to day operations.
> 
> I'll attach a useful log excerpt from the last 4 times the bug hit if I can.
> 
> If this is a bad bug entry, please tell me what I need to add.  It's my first
> entry on this bugzilla and I'm not sure what's required.  I'm sorry this bug
> report is on the FC5 stock kernels, but I'm not sure I can use a "vanilla"
> kernel instead of FC5 and not screw something up.  However, there are NO
> binary
> modules or any weird stuff on the box.  It's all stock FC5 rpms.
> 
> This box is a production box and the only one I have with 2 PPPoE connections
> to test.  I'm nearly positive it's either a 2-PPPoE+advanced-routing problem
> or
> a 2-PPPoE+HTB problem.  Since I've seen no other hits on google or elsewhere
> that are exactly like this bug, I must assume it's something fairly unique to
> this box: but what combination?!
> 
> I've had a Redhat bugzilla open on this since Sep 2005 with zero replies!  It
> shows more detail and my thought process over the years.
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169502
> 
> Steps to reproduce:
> Haven't figured out a way to reliably hit this bug.  Any hints to allow
> easier
> testing (which must be done onsite) are welcome.
> 

I have a vague feeling that we fixed this in a later kernel.  Does anyone
recall?

Thanks.

Comment 5 Stephen Hemminger 2007-06-18 08:22:29 UTC

Subject: Re: [Bugme-new]  New: unregister_netdevice: waiting for
 ppp0 to become free. pppoe + multihome + htb qos?

On Mon, 18 Jun 2007 10:56:06 -0400
Chuck Ebbert <cebbert@redhat.com> wrote:

> 
> Is there any way to print the addresses the notifier is calling
> to try and release net device references? I see:
> 
> net/core/dev/c::netdev_wait_allrefs():
> 
>         while (atomic_read(&dev->refcnt) != 0) {
>                 if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
>                         rtnl_lock();
> 
>                         /* Rebroadcast unregister notification */
>                         raw_notifier_call_chain(&netdev_chain,
>                                             NETDEV_UNREGISTER, dev);
> 
> but don't see any way to print the functions that get called.

You could walk the chain and print the functions out, but it wouldn't
really help identify the problem. The problem is when a protocol forgets
to call dev_put() after calling dev_hold().  The notifier there is
just a last effort at beating a dead horse. It really should be removed
since it never helps.  The notifier in unregister does work, and calling
the notification repeatedly doesn't change anything.

Comment 6 Andrew Morton 2007-06-18 08:27:45 UTC

Subject: Re: [Bugme-new]  New: unregister_netdevice: waiting for
 ppp0 to become free. pppoe + multihome + htb qos?

On Mon, 18 Jun 2007 10:56:06 -0400 Chuck Ebbert <cebbert@redhat.com> wrote:

> 
> Is there any way to print the addresses the notifier is calling
> to try and release net device references? I see:
> 
> net/core/dev/c::netdev_wait_allrefs():
> 
>         while (atomic_read(&dev->refcnt) != 0) {
>                 if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
>                         rtnl_lock();
> 
>                         /* Rebroadcast unregister notification */
>                         raw_notifier_call_chain(&netdev_chain,
>                                             NETDEV_UNREGISTER, dev);
> 
> but don't see any way to print the functions that get called.

Nope.  I guess we could add some print_notifier_call_chain() thing, but
then we'd need one flavour per locking scheme and it would get ridiculous.

I guess just an unlocked version would be OK - it's just a debug thing.

Comment 7 Stephen Hemminger 2007-09-05 06:56:56 UTC

This is a similar match to an existing bug.

*** This bug has been marked as a duplicate of bug 6197 ***