Bug 9225
Summary: | (net typhoon) "no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp" | ||
---|---|---|---|
Product: | Drivers | Reporter: | Dmitry Sanarin (sanarin) |
Component: | Network | Assignee: | Jeff Garzik (jgarzik) |
Status: | REJECTED INSUFFICIENT_DATA | ||
Severity: | high | CC: | dave, lkrumov_subs |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.23.1 SMP PREEMPT | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
Kernel panic photo
full dmesg output full dmesg output then cards swapped Our machines photo lspci -xxx otput lspci -xxx output dmesg output with the debug message fresh lshw output at the moment with the debug message fresh lspci -v output at the moment with the debug message |
Description
Dmitry Sanarin
2007-10-25 06:45:41 UTC
Created attachment 13279 [details]
Kernel panic photo
lspci 00:00.0 Host bridge: Intel Corporation 945G/GZ/P/PL Express Memory Controller Hub (rev 02) 00:01.0 PCI bridge: Intel Corporation 945G/GZ/P/PL Express PCI Express Root Port (rev 02) 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) 00:1c.1 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 2 (rev 01) 00:1c.2 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 3 (rev 01) 00:1c.3 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4 (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01) 00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA Storage Controller IDE (rev 01) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 01:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1) 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8111 PCI Express-to-PCI Bridge (rev 21) 05:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1) 06:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1) 07:0c.0 Ethernet controller: 3Com Corporation 3C990B-TX-M/3C990BSVR [Typhoon2] (rev 03) 07:0d.0 Ethernet controller: 3Com Corporation 3C990B-TX-M/3C990BSVR [Typhoon2] (rev 03) 07:0e.0 Ethernet controller: 3Com Corporation 3C990B-TX-M/3C990BSVR [Typhoon2] (rev 03) 07:0f.0 Serial controller: Moxa Technologies Co Ltd Intellio C218 Turbo PCI (rev 02) Reply-To: akpm@linux-foundation.org On Thu, 25 Oct 2007 06:45:42 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9225 > > Summary: typhoon : "no descs for cmd, had (needed) 0 (1) cmd, 31 > (7) resp" > Product: Drivers > Version: 2.5 > KernelVersion: 2.6.23.1 SMP PREEMPT > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Network > AssignedTo: jgarzik@pobox.com > ReportedBy: sanarin@nikiet.ru > > > Most recent kernel where this bug did not occur: none > Distribution: Fedora Core 4 > Hardware Environment: Intel 845 chipset, 3 x 3C990-FX-97 NICs > Software Environment: > Problem Description: > Our dmesg buffer fill of these messages > > eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp > eth1: error getting stats > eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp > eth1: error getting stats > eth1: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp > eth1: error getting stats > > and so one. > > After for days of working we got kernel panic. > > Steps to reproduce: (switching to email - please respond via reply-to-all, not via the bugzilla web interface) typhoon.c doesn't appear to have a maintainer. It oopsed in typhoon_rx - there's a jpg attached to the bugzilla report. I am the maintainer, and I'm listed in the file, but this is the first time I've heard of this. I must have missed it come through netdev. How were you generating the "no descs" error? Did you have something banging away at it try to get stats? I ask as I don't recall that the driver uses any command descriptors once it is up, unless it is retrieving stats. But, is has been a while since I've been in the source code. As for the oops, I'll have to think about how that's happening. Also, please attach a full dmesg from a boot. It'd be nice if it showed the error messages, but that is not necessary. Couple of other questions -- was it only doing this on eth1? You have two other typhoons, so I wonder about a bad/stuck card. Doesn't explain the oops, though. Please send me/attach the typhoon.ko you were using so that I can decode where it is dying in the function. What were you doing when it oopsed? Any VLANs in the configuration? Nevermind the typhoon.ko -- it looks like its oopsing at skb_copy_to_linear_data() in typhoon_rx(), but I'm not sure why skb would be NULL... Will keep looking... Our cards are not bad/stuck. We got 60 those cards in 20 machines and all the hardware works bad. I'l send full dmesg output and other details tomorrow. So, they continue to send packets while this is happening? When sending the dmesg, please make sure to get from boot until the first time this happens. Thank you for following up. Created attachment 13974 [details]
full dmesg output
Cards works fine while we got the errors. They are reseives/sends at good speed. But after few days such working, we usually get kernel panic. The only one having an error in the dmesg is eth2 -- and it looks like it is having a problem from the get-go. Is it still passing traffic at the end of this? That's a separate data path than the commands, but still would be useful to double check. You mentioned that you have 20 machines doing this -- are they identical? Is it always the card in the same physical slot that gives the issue -- slot 06:0e? I'm not sure how that maps to the physical slot. Do the other typhoon cards in this machine eventually show the same problem, or are they good until you get a panic? Can you swap the location of two cards in this same machine and post a dmesg of the problem happening? I'm still leaning towards some hardware interaction that I'd like to understand, or rule out, but I'm concerned about the oops, and now that I have part of my test network back together, I'm seeing an occasional issue when shutting down the card on my desktop machine, but not every time. Also, it looks like you are using Fedora 8 with the SMP kernel. If I send you a typhoon.ko with extra debugging, will you be able to run it for me? Or will it be better to send source and have you compile it? Thank you for sticking with this; I'd like to run it down. Created attachment 14083 [details]
full dmesg output then cards swapped
I have swapped cards.
I can test your typhoon.ko with extra debugging. You can send me source or typhoon.ko module. Created attachment 14084 [details]
Our machines photo
Our machines are almost identical. This is industrial PICMG 1.3 machines. They consist of motherboard and the Backplane.
Some of them got Pentium D CPU, some of them - Core 2 Duo. Some motherboard are based on Intel 945 chipset, others are based on Intel 965 chipset. We are testing this machines now and sometimes replacing some hardware.
Seems error only occurs then network card is in slot 06:0e ! I have removed card from this slot and now I have no any errors and problems yet. I leave machines to work some time... I'm glad to hear that avoiding that slot seems to also avoid the issue. Please let me know how your extended testing goes. If you get a chance, could you please attach the output of 'lspci -xxx' to this bug? I'd like to see that with one of the typhoon cards in slot 06:0e -- I'm interested in looking at the bridge setup to see if anything sticks out. Also, is 06:0e the farthest slot from the CPU/south bridge by any chance? In any event, I can see there are a few things the driver could do to be a bit more robust to hardware failure -- like avoiding an oops! -- so I'll try to get those done soon. Thanks for testing! While I'm thinking about it -- lspci -xxx output for the card both in slot 06:0e and not in 06:0e would be useful for comparison purposes. Created attachment 14167 [details]
lspci -xxx otput
This is working configuration. Machine worked 6 days without any problem.
Created attachment 14168 [details]
lspci -xxx output
Got "no descs for cmd" using this setup.
looks like card in slot 06:0e is not able to send/reseive any data... Thank you for collecting the lspci output. Unfortunately, nothing jumps out at me in it. :( Can I ask you to do a few more tests for me, please? Put the typhoon in 06:0e, and have it be the only add-on card in the system. If you need another card for the disks, please make sure it is on a different PCI bus (ie, move the ASUS 8211 RAID controller to a different bus if it must be installed). Does the typhoon work in this configuration? If not, can you try a different card (non-typhoon) and test that? I want to check if that slot will work for anything, or if it is just plain unusable. If the slot is workable, please put everything back in, but swap the ASUS 8211 and the typhoon in 06:0e and see if they keep working? You may want to skip this test if you want to keep the data on the disks attached to the 8211 -- this could corrupt things. I'll keep thinking about this, perhaps I can post a patch to dump the interface variables to dmesg so we can see if they are getting corrupted or somesuch... This bug has been inactive for nearly 5 months. Dmitry, can you give us an update on any further testing in that slot? Recently I experienced almost the same problem with the same NICs: Apr 3 03:05:42 pl1 kernel: [ 2422.179135] typhoon 0000:03:0d.0: eth3: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp Apr 3 03:05:42 pl1 kernel: [ 2422.179909] typhoon 0000:03:0c.0: eth9: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp Apr 3 03:05:42 pl1 kernel: [ 2422.180724] typhoon 0000:03:0b.0: eth4: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp My problem was that any of the three cards did not worked when I connect the fiber cable. The same cable connected to another machine (w/different fiber card - 1Gb/s) works. Then I found this post and started reshuffling the cards in the machine. It is ASUS P4C800-E (w AGP/Pro). Unfortunately the ASUS site has no the spec. page anymore, but the manual for the motherboard is there: http://dlcdnet.asus.com/pub/ASUS/mb/sock478/P4C800E-DX/e1347b_p4c800-e_deluxe.pdf ***** When I make the interface UP - the kernel stops reporting the problem, but the FIBER LAN cards do not work. ***** Next many cards reshuffle, tests, reboots ... I burnt and replaced the video card, exhausted and replaced the CMOS battery and found the funny think: -- the problem of mine was in the hard-setup port of the switch into 1Gb/s (the 3CR990 cards are 100Mb/s) :) Anyway the KERNEL keeps reporting this: ... no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp until I make: > ifconfig ethX up for each and every typhoon interface. I do not use the machine in real. I'm still setting it up, but I did not have any KERNEL PANICs since I built the machine on 13.Mar.2014 (3 weeks so far) I'd be glad if I can help in solving the nasty message from typhoon module. I can provide any/all details on the machine [but not user/pass ;) ] /proc/version: Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.54-2 MB: ASUS P4C800-E-Delux CPU: Intel(R) Pentium(R) 4 CPU 2.60GHz RAM: 2 x 256MB DDR-266 (DUAL Channel) Extension Cards: SLOT ------- CARD (slot# is according the motherboard manual) AGP-Pro: VGA compatible controller: NVIDIA Corporation NV11 [GeForce2 MX/MX 400] (rev a1) (the slot is AGP Pro, but the card AGPx4) onboard: Ethernet controller: Intel Corporation 82547EI Gigabit Ethernet Controller PCI1: Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) ---> This is DUAL-UTP-ports PCI2: Ethernet controller: D-Link System Inc DL10050 Sundance Ethernet (rev 15) ----> QUAD UTP Ports PCI3: Ethernet controller: 3Com Corporation 3CR990-FX-95/97/95 [Typhon Fiber] (rev 02) PCI4: Ethernet controller: 3Com Corporation 3CR990-FX-95/97/95 [Typhon Fiber] (rev 02) PCI5: Ethernet controller: 3Com Corporation 3CR990-FX-95/97/95 [Typhon Fiber] (rev 02) 10x ! cheers, LAZA Are you comfortable building a custom kernel? If so, please edit the message in typhoon_issue_command() (drivers/net/ethernet/3com/typhoon.c) to print the command that we're attempting to issue. Alternatively, add another netdev_err() line below it in the if branch: netdev_err(tp->dev, "command %04x\n", cmd->cmd); Sending me a full dmesg with the output should give us some insight. Thanks! Created attachment 133471 [details]
dmesg output with the debug message
Created attachment 133481 [details]
fresh lshw output at the moment with the debug message
Created attachment 133491 [details]
fresh lspci -v output at the moment with the debug message
Sorry for the delay - "too many opportunities" and some holidays :) I compiled the new kernel. Took the source from DEBIAN repository - last stable wheezy - package linux-source-3.2 (3.2.54-2). Did not replace all the kernel&modules but only the typhoon.ko file. The running kernel is the same version as compiled one. The actual line put by me was: netdev_err(tp->dev, "DEBUG: command %04x\n", cmd->cmd); I created a new INIT RAMDISK apparently this module is loaded by the ramdisk. I attached the output from dmesg, lspci -v , lshw As you can see the output is: typhoon 0000:03:0d.0: eth3: DEBUG: command 0007 typhoon 0000:03:0d.0: eth3: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp typhoon 0000:03:0c.0: eth9: DEBUG: command 0007 typhoon 0000:03:0c.0: eth9: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp typhoon 0000:03:0b.0: eth4: DEBUG: command 0007 typhoon 0000:03:0b.0: eth4: no descs for cmd, had (needed) 0 (1) cmd, 31 (7) resp If you need any further info or some test (in example with genuine kernel from kernel.org) - I still can play with the machine. Not yet prod. 10x LAZA Ah, I think I see the problem. In your modified typhoon.c, please see typhoon_get_settings() -- we want to make the call to typhoon_do_get_stats() dependent on the state of card, like we do in typhoon_get_stats(). It should look something like: smp_rmb(); if (tp->card_state == Running) { typhoon_do_get_stats(tp); ethtool_cmd_speed_set(cmd, tp->speed); } else { tp->speed = SPEED_UNKNOWN; } I suspect this will fix the problem you see. Please let me know how it works, and if we're successful, then I will prepare a proper patch for the mainline kernel. Thanks for getting me the info! CONFIRMED :) The proposed if-then-else check fixed the problem with flooding the syslog with the unwanted messages. I've tested it all over again and with this check everything works fine - negotiation, re-negotiation ... and most important - no more messages in the syslog :) It would be nice to see this patch in future kernel releases. I believe the initial BUG can be closed. The cause of this message is not something that affects the overall system stability. Most probably the KERNEL DUMPs reported initially by Dmitry Sanarin are from another problem with either the machines or the Fedora Kernel Thank you for the support ! Cheers, LAZA |