Bug 9225 - (net typhoon) "no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp"
Summary: (net typhoon) "no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp"
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-10-25 06:45 UTC by Dmitry Sanarin
Modified: 2014-04-24 21:03 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.23.1 SMP PREEMPT
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Kernel panic photo (94.47 KB, image/jpeg)
2007-10-25 06:46 UTC, Dmitry Sanarin
Details
full dmesg output (33.48 KB, text/plain)
2007-12-11 08:26 UTC, Dmitry Sanarin
Details
full dmesg output then cards swapped (34.27 KB, text/plain)
2007-12-17 08:36 UTC, Dmitry Sanarin
Details
Our machines photo (53.43 KB, image/jpeg)
2007-12-17 08:53 UTC, Dmitry Sanarin
Details
lspci -xxx otput (22.55 KB, text/plain)
2007-12-24 03:17 UTC, Dmitry Sanarin
Details
lspci -xxx output (22.55 KB, text/plain)
2007-12-24 03:53 UTC, Dmitry Sanarin
Details
dmesg output with the debug message (50.50 KB, text/plain)
2014-04-23 23:07 UTC, Lazar Krumov
Details
fresh lshw output at the moment with the debug message (21.56 KB, text/plain)
2014-04-23 23:09 UTC, Lazar Krumov
Details
fresh lspci -v output at the moment with the debug message (9.82 KB, text/plain)
2014-04-23 23:09 UTC, Lazar Krumov
Details

Description Dmitry Sanarin 2007-10-25 06:45:41 UTC
Most recent kernel where this bug did not occur: none
Distribution: Fedora Core 4
Hardware Environment: Intel 845 chipset, 3 x 3C990-FX-97 NICs
Software Environment:
Problem Description:
 Our dmesg buffer fill of these messages

eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
eth1: error getting stats
eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
eth1: error getting stats
eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
eth1: error getting stats

and so one.

 After for days of working we got kernel panic.

Steps to reproduce:
Comment 1 Dmitry Sanarin 2007-10-25 06:46:39 UTC
Created attachment 13279 [details]
Kernel panic photo
Comment 2 Dmitry Sanarin 2007-10-25 06:50:08 UTC
lspci

00:00.0 Host bridge: Intel Corporation 945G/GZ/P/PL Express Memory Controller Hub (rev 02)
00:01.0 PCI bridge: Intel Corporation 945G/GZ/P/PL Express PCI Express Root Port (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01)
00:1c.1 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 2 (rev 01)
00:1c.2 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 3 (rev 01)
00:1c.3 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA Storage Controller IDE (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1)
03:00.0 PCI bridge: PLX Technology, Inc. PEX 8111 PCI Express-to-PCI Bridge (rev 21)
05:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1)
06:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1)
07:0c.0 Ethernet controller: 3Com Corporation 3C990B-TX-M/3C990BSVR [Typhoon2] (rev 03)
07:0d.0 Ethernet controller: 3Com Corporation 3C990B-TX-M/3C990BSVR [Typhoon2] (rev 03)
07:0e.0 Ethernet controller: 3Com Corporation 3C990B-TX-M/3C990BSVR [Typhoon2] (rev 03)
07:0f.0 Serial controller: Moxa Technologies Co Ltd Intellio C218 Turbo PCI (rev 02)
Comment 3 Anonymous Emailer 2007-10-25 13:20:19 UTC
Reply-To: akpm@linux-foundation.org

On Thu, 25 Oct 2007 06:45:42 -0700 (PDT)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9225
> 
>            Summary: typhoon : "no descs for cmd, had (needed) 0 (1) cmd, 31
>                     (7) resp"
>            Product: Drivers
>            Version: 2.5
>      KernelVersion: 2.6.23.1 SMP PREEMPT
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Network
>         AssignedTo: jgarzik@pobox.com
>         ReportedBy: sanarin@nikiet.ru
> 
> 
> Most recent kernel where this bug did not occur: none
> Distribution: Fedora Core 4
> Hardware Environment: Intel 845 chipset, 3 x 3C990-FX-97 NICs
> Software Environment:
> Problem Description:
>  Our dmesg buffer fill of these messages
> 
> eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
> eth1: error getting stats
> eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
> eth1: error getting stats
> eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
> eth1: error getting stats
> 
> and so one.
> 
>  After for days of working we got kernel panic.
> 
> Steps to reproduce:

(switching to email - please respond via reply-to-all, not via the bugzilla
web interface)

typhoon.c doesn't appear to have a maintainer.  

It oopsed in typhoon_rx - there's a jpg attached to the bugzilla report.
Comment 4 David Dillow 2007-12-04 12:10:31 UTC
I am the maintainer, and I'm listed in the file, but this is the first time I've heard of this. I must have missed it come through netdev.

How were you generating the "no descs" error? Did you have something banging away at it try to get stats? I ask as I don't recall that the driver uses any command descriptors once it is up, unless it is retrieving stats. But, is has been a while since I've been in the source code.

As for the oops, I'll have to think about how that's happening.
Comment 5 David Dillow 2007-12-04 12:11:24 UTC
Also, please attach a full dmesg from a boot. It'd be nice if it showed the error messages, but that is not necessary.
Comment 6 David Dillow 2007-12-07 13:21:09 UTC
Couple of other questions -- was it only doing this on eth1? You have two other typhoons, so I wonder about a bad/stuck card.

Doesn't explain the oops, though. Please send me/attach the typhoon.ko you were using so that I can decode where it is dying in the function.

What were you doing when it oopsed? Any VLANs in the configuration?
Comment 7 David Dillow 2007-12-07 14:03:34 UTC
Nevermind the typhoon.ko -- it looks like its oopsing at skb_copy_to_linear_data() in typhoon_rx(), but I'm not sure why skb would be NULL...

Will keep looking...
Comment 8 Dmitry Sanarin 2007-12-10 05:45:25 UTC
 Our cards are not bad/stuck. We got 60 those cards in 20 machines and all the hardware works bad. I'l send full dmesg output and other details tomorrow.
Comment 9 David Dillow 2007-12-10 06:27:28 UTC
So, they continue to send packets while this is happening?

When sending the dmesg, please make sure to get from boot until the first time this happens.

Thank you for following up.
Comment 10 Dmitry Sanarin 2007-12-11 08:26:44 UTC
Created attachment 13974 [details]
full dmesg output
Comment 11 Dmitry Sanarin 2007-12-11 08:39:47 UTC
 Cards works fine while we got the errors. They are reseives/sends at good speed. But after few days such working, we usually get kernel panic.
Comment 12 David Dillow 2007-12-12 21:57:27 UTC
The only one having an error in the dmesg is eth2 -- and it looks like it is having a problem from the get-go. Is it still passing traffic at the end of this? That's a separate data path than the commands, but still would be useful to double check.

You mentioned that you have 20 machines doing this -- are they identical? Is it always the card in the same physical slot that gives the issue -- slot 06:0e? I'm not sure how that maps to the physical slot. Do the other typhoon cards in this machine eventually show the same problem, or are they good until you get a panic?

Can you swap the location of two cards in this same machine and post a dmesg of the problem happening?

I'm still leaning towards some hardware interaction that I'd like to understand, or rule out, but I'm concerned about the oops, and now that I have part of my test network back together, I'm seeing an occasional issue when shutting down the card on my desktop machine, but not every time.

Also, it looks like you are using Fedora 8 with the SMP kernel. If I send you a typhoon.ko with extra debugging, will you be able to run it for me? Or will it be better to send source and have you compile it?

Thank you for sticking with this; I'd like to run it down.
Comment 13 Dmitry Sanarin 2007-12-17 08:36:36 UTC
Created attachment 14083 [details]
full dmesg output then cards swapped

 I have swapped cards.
Comment 14 Dmitry Sanarin 2007-12-17 08:48:03 UTC
 I can test your typhoon.ko with extra debugging. You can send me source or typhoon.ko module.
Comment 15 Dmitry Sanarin 2007-12-17 08:53:57 UTC
Created attachment 14084 [details]
Our machines photo

 Our machines are almost identical. This is industrial PICMG 1.3 machines. They consist of motherboard and the Backplane.
 Some of them got Pentium D CPU, some of them - Core 2 Duo. Some motherboard are based on Intel 945 chipset, others are based on Intel 965 chipset. We are testing this machines now and sometimes replacing some hardware.
Comment 16 Dmitry Sanarin 2007-12-17 09:32:49 UTC
 Seems error only occurs then network card is in slot 06:0e ! I have removed card from this slot and now I have no any errors and problems yet. I leave machines to work some time...
Comment 17 David Dillow 2007-12-18 20:22:41 UTC
I'm glad to hear that avoiding that slot seems to also avoid the issue. Please let me know how your extended testing goes.

If you get a chance, could you please attach the output of 'lspci -xxx' to this bug? I'd like to see that with one of the typhoon cards in slot 06:0e -- I'm interested in looking at the bridge setup to see if anything sticks out.

Also, is 06:0e the farthest slot from the CPU/south bridge by any chance?

In any event, I can see there are a few things the driver could do to be a bit more robust to hardware failure -- like avoiding an oops! -- so I'll try to get those done soon.

Thanks for testing!
Comment 18 David Dillow 2007-12-18 20:33:20 UTC
While I'm thinking about it -- lspci -xxx output for the card both in slot 06:0e and not in 06:0e would be useful for comparison purposes.
Comment 19 Dmitry Sanarin 2007-12-24 03:17:39 UTC
Created attachment 14167 [details]
lspci -xxx otput

 This is working configuration. Machine worked 6 days without any problem.
Comment 20 Dmitry Sanarin 2007-12-24 03:53:49 UTC
Created attachment 14168 [details]
lspci -xxx output

 Got "no descs for cmd" using this setup.
Comment 21 Dmitry Sanarin 2007-12-24 05:44:39 UTC
looks like card in slot 06:0e is not able to send/reseive any data... 
Comment 22 David Dillow 2007-12-30 21:29:01 UTC
Thank you for collecting the lspci output. Unfortunately, nothing jumps out at me in it. :(

Can I ask you to do a few more tests for me, please?

Put the typhoon in 06:0e, and have it be the only add-on card in the system. If you need another card for the disks, please make sure it is on a different PCI bus (ie, move the ASUS 8211 RAID controller to a different bus if it must be installed). Does the typhoon work in this configuration? If not, can you try a different card (non-typhoon) and test that? I want to check if that slot will work for anything, or if it is just plain unusable.

If the slot is workable, please put everything back in, but swap the ASUS 8211 and the typhoon in 06:0e and see if they keep working? You may want to skip this test if you want to keep the data on the disks attached to the 8211 -- this could corrupt things.

I'll keep thinking about this, perhaps I can post a patch to dump the interface variables to dmesg so we can see if they are getting corrupted or somesuch...
Comment 23 David Dillow 2008-05-21 13:39:50 UTC
This bug has been inactive for nearly 5 months. Dmitry, can you give us an update on any further testing in that slot?
Comment 24 Lazar Krumov 2014-04-02 22:02:11 UTC
Recently I experienced almost the same problem with the same NICs:
Apr  3 03:05:42 pl1 kernel: [ 2422.179135] typhoon 0000:03:0d.0: eth3: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
Apr  3 03:05:42 pl1 kernel: [ 2422.179909] typhoon 0000:03:0c.0: eth9: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
Apr  3 03:05:42 pl1 kernel: [ 2422.180724] typhoon 0000:03:0b.0: eth4: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp

My problem was that any of the three cards did not worked when I connect the fiber cable. The same cable connected to another machine (w/different fiber card - 1Gb/s) works.

Then I found this post and started reshuffling the cards in the machine. It is ASUS P4C800-E (w AGP/Pro). Unfortunately the ASUS site has no the spec. page anymore, but the manual for the motherboard is there:
http://dlcdnet.asus.com/pub/ASUS/mb/sock478/P4C800E-DX/e1347b_p4c800-e_deluxe.pdf

*****
When I make the interface UP - the kernel stops reporting the problem, but the FIBER LAN cards do not work.
*****

Next many cards reshuffle, tests, reboots ... I burnt and replaced the video card, exhausted and replaced the CMOS battery and found the funny think:
 -- the problem of mine was in the hard-setup port of the switch into 1Gb/s  (the 3CR990 cards are 100Mb/s) :)

Anyway the KERNEL keeps reporting this:
... no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
until I make:
> ifconfig ethX up
for each and every typhoon interface.

I do not use the machine in real. I'm still setting it up, but I did not have any KERNEL PANICs since I built the machine on 13.Mar.2014 (3 weeks so far)

I'd be glad if I can help in solving the nasty message from typhoon module.
I can provide any/all details on the machine [but not user/pass ;) ]

/proc/version:
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.54-2
MB: ASUS P4C800-E-Delux
CPU: Intel(R) Pentium(R) 4 CPU 2.60GHz
RAM: 2 x 256MB DDR-266 (DUAL Channel)
Extension Cards:
SLOT  -------   CARD   (slot# is according the motherboard manual)
AGP-Pro:  VGA compatible controller: NVIDIA Corporation NV11 [GeForce2 MX/MX 400] (rev a1) (the slot is AGP Pro, but the card AGPx4)
onboard:  Ethernet controller: Intel Corporation 82547EI Gigabit Ethernet Controller
PCI1:     Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) ---> This is DUAL-UTP-ports
PCI2:     Ethernet controller: D-Link System Inc DL10050 Sundance Ethernet (rev 15) ----> QUAD UTP Ports
PCI3:     Ethernet controller: 3Com Corporation 3CR990-FX-95/97/95 [Typhon Fiber] (rev 02)
PCI4:     Ethernet controller: 3Com Corporation 3CR990-FX-95/97/95 [Typhon Fiber] (rev 02)
PCI5:     Ethernet controller: 3Com Corporation 3CR990-FX-95/97/95 [Typhon Fiber] (rev 02)


10x !

cheers,
LAZA
Comment 25 David Dillow 2014-04-13 01:56:59 UTC
Are you comfortable building a custom kernel?

If so, please edit the message in typhoon_issue_command() (drivers/net/ethernet/3com/typhoon.c) to print the command that we're attempting to issue. Alternatively, add another netdev_err() line below it in the if branch:
  netdev_err(tp->dev, "command %04x\n", cmd->cmd);

Sending me a full dmesg with the output should give us some insight.

Thanks!
Comment 26 Lazar Krumov 2014-04-23 23:07:58 UTC
Created attachment 133471 [details]
dmesg output with the debug message
Comment 27 Lazar Krumov 2014-04-23 23:09:08 UTC
Created attachment 133481 [details]
fresh lshw output at the moment with the debug message
Comment 28 Lazar Krumov 2014-04-23 23:09:40 UTC
Created attachment 133491 [details]
fresh lspci -v  output at the moment with the debug message
Comment 29 Lazar Krumov 2014-04-23 23:17:52 UTC
Sorry for the delay - "too many opportunities" and some holidays :)

I compiled the new kernel. Took the source from DEBIAN repository - last stable wheezy - package linux-source-3.2 (3.2.54-2).
Did not replace all the kernel&modules but only the typhoon.ko file. The running kernel is the same version as compiled one.

The actual line put by me was:
netdev_err(tp->dev, "DEBUG: command %04x\n", cmd->cmd);

I created a new INIT RAMDISK apparently this module is loaded by the ramdisk.

I attached the output from dmesg, lspci -v , lshw

As you can see the output is:
typhoon 0000:03:0d.0: eth3: DEBUG: command 0007
typhoon 0000:03:0d.0: eth3: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
typhoon 0000:03:0c.0: eth9: DEBUG: command 0007
typhoon 0000:03:0c.0: eth9: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp
typhoon 0000:03:0b.0: eth4: DEBUG: command 0007
typhoon 0000:03:0b.0: eth4: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp

If you need any further info or some test (in example with genuine kernel from kernel.org) - I still can play with the machine. Not yet prod.

10x
LAZA
Comment 30 David Dillow 2014-04-24 03:28:16 UTC
Ah, I think I see the problem. In your modified typhoon.c, please see typhoon_get_settings() -- we want to make the call to typhoon_do_get_stats() dependent on the state of card, like we do in typhoon_get_stats().

It should look something like:
smp_rmb();
if (tp->card_state == Running) {
  typhoon_do_get_stats(tp);
  ethtool_cmd_speed_set(cmd, tp->speed);
} else {
  tp->speed = SPEED_UNKNOWN;
}

I suspect this will fix the problem you see. Please let me know how it works, and if we're successful, then I will prepare a proper patch for the mainline kernel.

Thanks for getting me the info!
Comment 31 Lazar Krumov 2014-04-24 21:03:31 UTC
CONFIRMED :)

The proposed if-then-else check fixed the problem with flooding the syslog with the unwanted messages. I've tested it all over again and with this check everything works fine - negotiation, re-negotiation ... and most important - no more messages in the syslog :)

It would be nice to see this patch in future kernel releases.

I believe the initial BUG can be closed. The cause of this message is not something that affects the overall system stability. 
Most probably the KERNEL DUMPs reported initially by Dmitry Sanarin are from another problem with either the machines or the Fedora Kernel

Thank you for the support !

Cheers,
LAZA

Note You need to log in before you can comment on or make changes to this bug.