Bug 20092 - Sky2 module causes network crash when traffic is high
Summary: Sky2 module causes network crash when traffic is high
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Arnaldo Carvalho de Melo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-11 16:44 UTC by Thorsten Roth
Modified: 2012-08-27 17:03 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.35
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel log (3.62 KB, application/octet-stream)
2010-10-11 16:44 UTC, Thorsten Roth
Details
lspci output (23.75 KB, text/plain)
2010-10-11 16:45 UTC, Thorsten Roth
Details

Description Thorsten Roth 2010-10-11 16:44:22 UTC
Created attachment 33262 [details]
kernel log

This problems seems to persist for quite some kernel versions now, appearing in one or the other form. When sending high amounts of data over the network, after some time the kernel module crashes and it also does not seem to be possible to get it back up. Most of the time this happens after >10GiB of data have been transmitted, but it does not seem to be deterministic. 

This happens only with the sky2 module. For older kernel versions, it was possible to substitute sky2 with the sk98lin module, which worked perfectly, but this is not yet available for 2.6.35.
Comment 1 Thorsten Roth 2010-10-11 16:45:01 UTC
Created attachment 33272 [details]
lspci output
Comment 2 Patrick Hemmer 2010-12-18 20:29:56 UTC
I have the exact same problem with kernel 2.6.32 on RHEL6.

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: ProLiant DL360 G5
NETDEV WATCHDOG: eth0 (bnx2): transmit queue 0 timed out
Modules linked in: nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx autofs4 sunrpc p4_clockmod freq_table speedstep_lib ipv6 dm_mirror dm_region_hash dm_log aoe hpilo ipmi_si ipmi_msghandler bnx2 sg serio_raw iTCO_wdt iTCO_vendor_support i5k_amb hwmon i5000_edac edac_core shpchp ext4 mbcache jbd2 sr_mod cdrom ata_generic pata_acpi ata_piix hpsa cciss radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod [last unloaded: freq_table]
Pid: 0, comm: swapper Not tainted 2.6.32-71.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0
 [<ffffffff8106b946>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff8142928d>] dev_watchdog+0x26d/0x280
 [<ffffffff81097de5>] ? sched_clock_local+0x25/0x90
 [<ffffffff81429020>] ? dev_watchdog+0x0/0x280
 [<ffffffff8107de07>] run_timer_softirq+0x197/0x340
 [<ffffffff810a0c90>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8102f52d>] ? lapic_next_event+0x1d/0x30
 [<ffffffff81073bd7>] __do_softirq+0xb7/0x1e0
 [<ffffffff81095a50>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff810142cc>] call_softirq+0x1c/0x30
 [<ffffffff81015f35>] do_softirq+0x65/0xa0
 [<ffffffff810739d5>] irq_exit+0x85/0x90
 [<ffffffff814cfa01>] smp_apic_timer_interrupt+0x71/0x9c
 [<ffffffff81013c93>] apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff8101bc01>] ? mwait_idle+0x71/0xd0
 [<ffffffff814cd80a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff81011e96>] cpu_idle+0xb6/0x110
 [<ffffffff814b09fa>] rest_init+0x7a/0x80
 [<ffffffff818c1ecd>] start_kernel+0x413/0x41f
 [<ffffffff818c133a>] x86_64_start_reservations+0x125/0x129
 [<ffffffff818c1438>] x86_64_start_kernel+0xfa/0x109
---[ end trace a22b8de6237e0ed6 ]---
bnx2: eth0 DEBUG: intr_sem[0]
bnx2: eth0 DEBUG: EMAC_TX_STATUS[00000008] RPM_MGMT_PKT_CTRL[00000000]
bnx2: eth0 DEBUG: MCP_STATE_P0[00000000] MCP_STATE_P1[00000000]
bnx2: eth0 DEBUG: HC_STATS_INTERRUPT_STATUS[00000000]
bnx2: eth0 NIC Copper Link is Down
Comment 3 Patrick Hemmer 2010-12-18 20:37:59 UTC
Oh, and my problem was obviously with bnx2, not sky2. but the error occurred at the same place
Comment 4 Robin Rainton 2011-03-28 09:15:13 UTC
I can confirm this is happening with FC14 2.6.35.11-83.fc14.i686.PAE

06:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 22)

This is the on-board controller on a DFI Lanparty Jr X58 T3H6. It has been working fine without load, but recently started using drives mounted on a server via NFS (all Gb LAN) and the crash now happens frequently :(

I had read that this might be to do with jumbo frames or bridging, but this is the only NIC on this system (eth0) and the MTU is 1500.

Output from /var/log/messages:

--- snip ---
Mar 28 19:31:11 hsem kernel: [ 3965.219102] sky2 0000:06:00.0: error interrupt status=0x80000000
Mar 28 19:31:11 hsem kernel: [ 3965.219111] sky2 0000:06:00.0: PCI hardware error (0x2010)
Mar 28 19:31:23 hsem kernel: [ 3977.281589] ------------[ cut here ]------------
Mar 28 19:31:23 hsem kernel: [ 3977.281600] WARNING: at net/sched/sch_generic.c:258 dev_watchdog+0xc6/0x12e()
Mar 28 19:31:23 hsem kernel: [ 3977.281604] Hardware name: System Product
Mar 28 19:31:23 hsem kernel: [ 3977.281607] NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
Mar 28 19:31:23 hsem kernel: [ 3977.281609] Modules linked in: tcp_lp vfat fat fuse nfsd exportfs hwmon_vid coretemp nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq mperf ipv6 uinput rc_dib0700_rc5 nvidia(P) snd_hda_codec_realtek dvb_usb_dib0700 snd_hda_intel dib7000p dib0090 dib7000m dib0070 dvb_usb snd_hda_codec dib8000 snd_hwdep dib9000 ir_lirc_codec lirc_dev ir_sony_decoder dvb_core ir_jvc_decoder ir_rc6_decoder snd_seq ir_rc5_decoder ir_nec_decoder dib3000mc snd_seq_device rc_core dibx000_common snd_pcm joydev i2c_i801 microcode iTCO_wdt i7core_edac snd_timer i2c_core edac_core wmi serio_raw iTCO_vendor_support snd sky2 soundcore snd_page_alloc pata_acpi ata_generic usb_storage pata_jmicron [last unloaded: scsi_wait_scan]
Mar 28 19:31:23 hsem kernel: [ 3977.281671] Pid: 0, comm: swapper Tainted: P            2.6.35.11-83.fc14.i686.PAE #1
Mar 28 19:31:23 hsem kernel: [ 3977.281674] Call Trace:
Mar 28 19:31:23 hsem kernel: [ 3977.281681]  [<c043fd21>] warn_slowpath_common+0x6a/0x7f
Mar 28 19:31:23 hsem kernel: [ 3977.281686]  [<c073bc23>] ? dev_watchdog+0xc6/0x12e
Mar 28 19:31:23 hsem kernel: [ 3977.281690]  [<c043fda9>] warn_slowpath_fmt+0x2b/0x2f
Mar 28 19:31:23 hsem kernel: [ 3977.281695]  [<c073bc23>] dev_watchdog+0xc6/0x12e
Mar 28 19:31:23 hsem kernel: [ 3977.281702]  [<c0451b72>] ? insert_work+0x6e/0x77
Mar 28 19:31:23 hsem kernel: [ 3977.281705]  [<c07b9d9d>] ? _raw_spin_unlock_irqrestore+0x13/0x15
Mar 28 19:31:23 hsem kernel: [ 3977.281707]  [<c0451d5f>] ? __queue_work+0x2f/0x34
Mar 28 19:31:23 hsem kernel: [ 3977.281711]  [<c044a698>] run_timer_softirq+0x167/0x20e
Mar 28 19:31:23 hsem kernel: [ 3977.281713]  [<c073bb5d>] ? dev_watchdog+0x0/0x12e
Mar 28 19:31:23 hsem kernel: [ 3977.281716]  [<c0445172>] __do_softirq+0xa9/0x14a
Mar 28 19:31:23 hsem kernel: [ 3977.281718]  [<c0445246>] do_softirq+0x33/0x3d
Mar 28 19:31:23 hsem kernel: [ 3977.281721]  [<c044544f>] irq_exit+0x31/0x64
Mar 28 19:31:23 hsem kernel: [ 3977.281724]  [<c040a2f4>] do_IRQ+0x7d/0x91
Mar 28 19:31:23 hsem kernel: [ 3977.281726]  [<c0408f30>] common_interrupt+0x30/0x38
Mar 28 19:31:23 hsem kernel: [ 3977.281729]  [<c04400e0>] ? kmsg_dump_unregister+0x35/0x48
Mar 28 19:31:23 hsem kernel: [ 3977.281733]  [<c05f9b64>] ? intel_idle+0xf2/0x119
Mar 28 19:31:23 hsem kernel: [ 3977.281737]  [<c070540c>] cpuidle_idle_call+0x6e/0xc1
Mar 28 19:31:23 hsem kernel: [ 3977.281739]  [<c040778c>] cpu_idle+0x8e/0xaf
Mar 28 19:31:23 hsem kernel: [ 3977.281742]  [<c07a5219>] rest_init+0x71/0x73
Mar 28 19:31:23 hsem kernel: [ 3977.281746]  [<c0a3b7e8>] start_kernel+0x34a/0x34f
Mar 28 19:31:23 hsem kernel: [ 3977.281749]  [<c0a3b0da>] i386_start_kernel+0xc9/0xd0
Mar 28 19:31:23 hsem kernel: [ 3977.281751] ---[ end trace 1e532a3917036b97 ]---
Mar 28 19:31:23 hsem kernel: [ 3977.281753] sky2 0000:06:00.0: eth0: tx timeout
--- snip ---
Comment 5 Stephen Hemminger 2012-08-27 17:03:46 UTC
Please report RHEL bugs to Redhat, not the kernel bugzilla.
Need reproduction with current kernel (3.4 or later).

Note You need to log in before you can comment on or make changes to this bug.