Bug 41152

Summary: kernel 3.0 and above fails to handle vlan id 0 (802.1p) packets properly without hardware acceleration
Product: Networking Reporter: Mike Auty (mike.auty)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: CLOSED CODE_FIX    
Severity: normal CC: florian, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.0 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 36912    

Description Mike Auty 2011-08-14 12:48:14 UTC
Hi there,

I recently found that packets tagged with a vlan id of 0 are no longer received on the main interface.  There were no dmesg entries on the dropped packets.  I attempted to setup a vlan 0 interface and configure it, but couldn't successfully route traffic to the device.  I can recreate this on two of the three networking devices I have, my guess is that the third does successfully handle hardware acceleration of vlan tags. 

After a bisection this appears to be related to commit bcc6d47903612c3861201cc3a866fb604f26b8b2, which seems to try to merge the non-hardware accelerated and hardware accelerated code paths for handling vlans.  In the process, it appears vlan id 0 (802.1p) packets are no longer handled correctly.

Unfortunately I don't know the code paths well enough to figure out what's going wrong, but I'd be happy to provide more information, run tests or try out patches if it would help, just let me know.  Thanks...  5:)

Mike  5:)
Comment 1 Andrew Morton 2011-08-16 22:09:53 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun, 14 Aug 2011 12:48:16 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=41152
> 
>            Summary: kernel 3.0 and above fails to handle vlan id 0
>                     (802.1p) packets properly without hardware
>                     acceleration
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 3.0
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: acme@ghostprotocols.net
>         ReportedBy: mike.auty@gmail.com
>         Regression: Yes
> 
> 
> Hi there,
> 
> I recently found that packets tagged with a vlan id of 0 are no longer
> received
> on the main interface.  There were no dmesg entries on the dropped packets. 
> I
> attempted to setup a vlan 0 interface and configure it, but couldn't
> successfully route traffic to the device.  I can recreate this on two of the
> three networking devices I have, my guess is that the third does successfully
> handle hardware acceleration of vlan tags. 
> 
> After a bisection this appears to be related to commit
> bcc6d47903612c3861201cc3a866fb604f26b8b2, which seems to try to merge the
> non-hardware accelerated and hardware accelerated code paths for handling
> vlans.  In the process, it appears vlan id 0 (802.1p) packets are no longer
> handled correctly.
> 
> Unfortunately I don't know the code paths well enough to figure out what's
> going wrong, but I'd be happy to provide more information, run tests or try
> out
> patches if it would help, just let me know.  Thanks...  5:)
> 
> Mike  5:)
Comment 2 Jiri Pirko 2011-08-17 06:21:44 UTC
Wed, Aug 17, 2011 at 12:09:18AM CEST, akpm@linux-foundation.org wrote:
>
>(switched to email.  Please respond via emailed reply-to-all, not via the
>bugzilla web interface).
>
>On Sun, 14 Aug 2011 12:48:16 GMT
>bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=41152
>> 
>>            Summary: kernel 3.0 and above fails to handle vlan id 0
>>                     (802.1p) packets properly without hardware
>>                     acceleration
>>            Product: Networking
>>            Version: 2.5
>>     Kernel Version: 3.0
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: Other
>>         AssignedTo: acme@ghostprotocols.net
>>         ReportedBy: mike.auty@gmail.com
>>         Regression: Yes
>> 
>> 
>> Hi there,
>> 
>> I recently found that packets tagged with a vlan id of 0 are no longer
>> received
>> on the main interface.  There were no dmesg entries on the dropped packets. 
>> I
>> attempted to setup a vlan 0 interface and configure it, but couldn't
>> successfully route traffic to the device.  I can recreate this on two of the
>> three networking devices I have, my guess is that the third does
>> successfully
>> handle hardware acceleration of vlan tags. 
>> 
>> After a bisection this appears to be related to commit
>> bcc6d47903612c3861201cc3a866fb604f26b8b2, which seems to try to merge the
>> non-hardware accelerated and hardware accelerated code paths for handling
>> vlans.  In the process, it appears vlan id 0 (802.1p) packets are no longer
>> handled correctly.
>> 
>> Unfortunately I don't know the code paths well enough to figure out what's
>> going wrong, but I'd be happy to provide more information, run tests or try
>> out
>> patches if it would help, just let me know.  Thanks...  5:)
>> 
>> Mike  5:)

Hi Mike. May I ask what NIC are you seeing the regression on?
It may have something to do with dev->vlangrp and ndo_vlan_add/kill_vid.
VID 0 was recently only added by the latter ones. So if driver only
depended on dev->vlangrp, 0 was not there. This was changed recently by
my "vlan cleanup" patches. It may work for you again on net-next today.


Jirka
>
Comment 3 Mike Auty 2011-08-17 08:03:12 UTC
On 17/08/11 06:37, Jiri Pirko wrote:
> 
> Hi Mike. May I ask what NIC are you seeing the regression on?
> It may have something to do with dev->vlangrp and ndo_vlan_add/kill_vid.
> VID 0 was recently only added by the latter ones. So if driver only
> depended on dev->vlangrp, 0 was not there. This was changed recently by
> my "vlan cleanup" patches. It may work for you again on net-next today.
> 

Hi there,

I'm finding it on the following two NICs:

02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
Ethernet Adapter (rev b0)

05:01.0 Network controller [0280]: Broadcom Corporation BCM4306
802.11b/g Wireless LAN Controller [14e4:4320] (rev 03)

and (on a different machine):

02:00.0 Network controller [0280]: Intel Corporation WiFi Link 6000
Series [8086:422c] (rev 35)

The only one I have had any success with is:

00:19.0 Ethernet controller [0200]: Intel Corporation 82577LM Gigabit
Network Connection [8086:10ea] (rev 05)

Which I assume is because it has actual hardware acceleration.  I may be
able to put net-next on the Intel Wifi box for testing at some point,
but I don't know how soon I'll be able to do that.  Please let me know
whether that would be a worthwhile test, and if so I'll try and get it
done.  Thanks...

Mike  5:)
Comment 4 Jiri Pirko 2011-08-17 10:59:55 UTC
Wed, Aug 17, 2011 at 08:36:15AM CEST, mike.auty@gmail.com wrote:
>On 17/08/11 06:37, Jiri Pirko wrote:
>> 
>> Hi Mike. May I ask what NIC are you seeing the regression on?
>> It may have something to do with dev->vlangrp and ndo_vlan_add/kill_vid.
>> VID 0 was recently only added by the latter ones. So if driver only
>> depended on dev->vlangrp, 0 was not there. This was changed recently by
>> my "vlan cleanup" patches. It may work for you again on net-next today.
>> 
>
>Hi there,
>
>I'm finding it on the following two NICs:
>
>02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
>Ethernet Adapter (rev b0)
>
>05:01.0 Network controller [0280]: Broadcom Corporation BCM4306
>802.11b/g Wireless LAN Controller [14e4:4320] (rev 03)
>
>and (on a different machine):
>
>02:00.0 Network controller [0280]: Intel Corporation WiFi Link 6000
>Series [8086:422c] (rev 35)

I just obtained very similar card (8086:422b). Going to look at it right
away.

One more thing. What do you use to generate vlan0 tagged packets? I'm
using pktgen with "vlan_id 0". Would you please try that it behaves the
same for you?
	
>
>The only one I have had any success with is:
>
>00:19.0 Ethernet controller [0200]: Intel Corporation 82577LM Gigabit
>Network Connection [8086:10ea] (rev 05)
>
>Which I assume is because it has actual hardware acceleration.  I may be
>able to put net-next on the Intel Wifi box for testing at some point,
>but I don't know how soon I'll be able to do that.  Please let me know
>whether that would be a worthwhile test, and if so I'll try and get it
>done.  Thanks...
>
>Mike  5:)
>
Comment 5 Jiri Pirko 2011-08-17 17:50:52 UTC
Wed, Aug 17, 2011 at 12:59:51PM CEST, jpirko@redhat.com wrote:
>Wed, Aug 17, 2011 at 08:36:15AM CEST, mike.auty@gmail.com wrote:
>>On 17/08/11 06:37, Jiri Pirko wrote:
>>> 
>>> Hi Mike. May I ask what NIC are you seeing the regression on?
>>> It may have something to do with dev->vlangrp and ndo_vlan_add/kill_vid.
>>> VID 0 was recently only added by the latter ones. So if driver only
>>> depended on dev->vlangrp, 0 was not there. This was changed recently by
>>> my "vlan cleanup" patches. It may work for you again on net-next today.
>>> 
>>
>>Hi there,
>>
>>I'm finding it on the following two NICs:
>>
>>02:00.0 Ethernet controller: Attansic Technology Corp. L1 Gigabit
>>Ethernet Adapter (rev b0)
>>
>>05:01.0 Network controller [0280]: Broadcom Corporation BCM4306
>>802.11b/g Wireless LAN Controller [14e4:4320] (rev 03)
>>
>>and (on a different machine):
>>
>>02:00.0 Network controller [0280]: Intel Corporation WiFi Link 6000
>>Series [8086:422c] (rev 35)
>
>I just obtained very similar card (8086:422b). Going to look at it right
>away.
>
>One more thing. What do you use to generate vlan0 tagged packets? I'm
>using pktgen with "vlan_id 0". Would you please try that it behaves the
>same for you?

I'm using following pktgen script:
http://pastebin.com/E3f4R8XY

On rx site with Intel wireless card I use following stap script to
observe incoming packets:
http://pastebin.com/VeXLhauu

All is looking good there. What do you use to look at incoming packets?

Jirka

>       
>>
>>The only one I have had any success with is:
>>
>>00:19.0 Ethernet controller [0200]: Intel Corporation 82577LM Gigabit
>>Network Connection [8086:10ea] (rev 05)
>>
>>Which I assume is because it has actual hardware acceleration.  I may be
>>able to put net-next on the Intel Wifi box for testing at some point,
>>but I don't know how soon I'll be able to do that.  Please let me know
>>whether that would be a worthwhile test, and if so I'll try and get it
>>done.  Thanks...
>>
>>Mike  5:)
>>
Comment 6 Mike Auty 2011-08-17 22:48:48 UTC
On 17/08/11 11:59, Jiri Pirko wrote:
> 
> I just obtained very similar card (8086:422b). Going to look at it right
> away.
> 
> One more thing. What do you use to generate vlan0 tagged packets? I'm
> using pktgen with "vlan_id 0". Would you please try that it behaves the
> same for you?
>       

Sorry, I haven't been using pktgen.  I've got an actual device (a
Samsung android phone) which seems to tag all normal outbound packets
with this type of vlan tag.  I only discovered a month ago that I needed
the 8021q module to be able to talk to it, and then suddenly it stopped
working once I moved to the 3.0 kernel.

I might not have made it clear, but the packets are received (in so much
as the packet is definitely sent, and it's seen by tools such as
wireshark), but no reply is ever sent.  I've attached packet logs from
the 3.0.1 kernel and the 2.6.39.3 kernel.  Oddly the tagging only seems
to be used on the first SYN,ACK packet, but again I don't know enough
about the pipeline or what the Samsung kernel's doing to cause that.

I hope that's of some help?  I may be able to get systemtap support
rolled into my kernel tomorrow at some point, but if not then it will
have to wait until the weekend.  I don't know if that will provide
useful information for debugging this, but I am happy to run whatever
tests I can to figure this out...

Mike  5:)
Comment 7 Jiri Pirko 2011-08-18 16:37:09 UTC
Thu, Aug 18, 2011 at 12:48:41AM CEST, mike.auty@gmail.com wrote:
>On 17/08/11 11:59, Jiri Pirko wrote:
>> 
>> I just obtained very similar card (8086:422b). Going to look at it right
>> away.
>> 
>> One more thing. What do you use to generate vlan0 tagged packets? I'm
>> using pktgen with "vlan_id 0". Would you please try that it behaves the
>> same for you?
>>      
>
>Sorry, I haven't been using pktgen.  I've got an actual device (a
>Samsung android phone) which seems to tag all normal outbound packets
>with this type of vlan tag.  I only discovered a month ago that I needed
>the 8021q module to be able to talk to it, and then suddenly it stopped
>working once I moved to the 3.0 kernel.
>
>I might not have made it clear, but the packets are received (in so much
>as the packet is definitely sent, and it's seen by tools such as
>wireshark), but no reply is ever sent.  I've attached packet logs from
>the 3.0.1 kernel and the 2.6.39.3 kernel.  Oddly the tagging only seems
>to be used on the first SYN,ACK packet, but again I don't know enough
>about the pipeline or what the Samsung kernel's doing to cause that.
>
>I hope that's of some help?  I may be able to get systemtap support
>rolled into my kernel tomorrow at some point, but if not then it will
>have to wait until the weekend.  I don't know if that will provide
>useful information for debugging this, but I am happy to run whatever
>tests I can to figure this out...
>
>Mike  5:)

Patch posted:
http://patchwork.ozlabs.org/patch/110535/

sorry I forgot to cc you Mike. Thanks a lot for report!

Jirka
Comment 8 Mike Auty 2011-08-18 19:39:23 UTC
On 18/08/11 17:37, Jiri Pirko wrote:
> 
> Patch posted:
> http://patchwork.ozlabs.org/patch/110535/
> 
> sorry I forgot to cc you Mike. Thanks a lot for report!

No problem,

Thanks very much for the speedy fix!  I've applied the patch and can
confirm it solves my problem.  I look forward to seeing it hit the
mainline...  5:)

Mike  5:)
Comment 9 Florian Mickler 2011-08-19 08:45:17 UTC
Patch: http://patchwork.ozlabs.org/patch/110535/
Comment 10 Rafael J. Wysocki 2011-08-28 19:22:48 UTC
Fixed by commit c5114cd59d2664f258b0d021d79b1532d94bdc2b .