Bug 196399
Summary: | WARNING at net/sched/sch_generic.c:316 dev_watchdog[...] when suspending | ||
---|---|---|---|
Product: | Networking | Reporter: | Martin Peres (martin.peres) |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | CLOSED CODE_FIX | ||
Severity: | low | CC: | arkadiusz.hiler, daniel, lakshminarayana.vudum, svimik |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.11+ | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Martin Peres
2017-07-17 08:34:16 UTC
It was fixed. Let's close this bug. It was never fixed, I apparently just had forgotten about our CI system carrying this extra patch: https://cgit.freedesktop.org/drm/drm-intel/commit/?h=topic/core-for-CI&id=d576dd33ecc7238e4dee92baca89d9c5e3543a44 Sorry for the noise! That patch is just a "shoot the messenger" The root cause is a driver bug for what every network device you are using. That information is not present in this bug report. (In reply to Stephen Hemminger from comment #3) > That patch is just a "shoot the messenger" It absolutely is. I wrote this bug report immediately, but development team could not wait for the fixes to come upstream, so a developer just proposed to land this patch while we were waiting for comments. Then we got bigger fishes to fry and the bug has just rotten to the point where I even thought it was not a problem anymore... Shame on us! > > The root cause is a driver bug for what every network device you are using. > That information is not present in this bug report. Fair enough, and the links I provided are all dead because we ran out of space on the server providing test results... Who would have thought that we would generate TB of datas per month... Anyway, here is the information about a couple of hosts that exhibit the issue: - https://intel-gfx-ci.01.org/hardware/fi-snb-2600/ - https://intel-gfx-ci.01.org/hardware/fi-skl-6600u/ - https://intel-gfx-ci.01.org/hardware/fi-kbl-7560u/ I will try to verify tomorrow that it is still a problem on newer kernels than 4.11-rc5 (4.15). Thanks! Sorry for the slight delay. I managed to get some more recent results on a 4.15.0-rc1 base, and got 2 out of our 30 machines to reproduce the problem: - Skylake: https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_1460/fi-skl-6600u/igt@gem_exec_suspend@basic-s4-devices.html - Kabylake: https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_1460/fi-kbl-r/igt@gem_exec_suspend@basic-s4-devices.html Check out the boot0 link to get the boot kernel logs, and dmesg0 for the kernel logs that happened during the execution of our testsuite. Sorry about the i915 spamming in the logs, it hopefully is not too annoying for you. *** Bug 198027 has been marked as a duplicate of this bug. *** Just wanted to throw this out from our pile of hacks, but we're still hitting this: [ 224.739686] ------------[ cut here ]------------ [ 224.739712] WARNING: CPU: 3 PID: 2982 at net/sched/sch_generic.c:461 dev_watchdog+0x1fd/0x210 [ 224.739714] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm i915 asix usbnet mii mei_me mei prime_numbers i2c_hid pinctrl_sunrisepoint pinctrl_intel btusb btrtl btbcm btintel bluetooth ecdh_generic [ 224.739775] CPU: 3 PID: 2982 Comm: gem_exec_suspen Tainted: G U W 4.18.0-rc2-CI-Patchwork_9414+ #1 [ 224.739777] Hardware name: Dell Inc. XPS 13 9350/, BIOS 1.4.12 11/30/2016 [ 224.739780] RIP: 0010:dev_watchdog+0x1fd/0x210 [ 224.739781] Code: 49 63 4c 24 f0 eb 92 4c 89 ef c6 05 21 46 ad 00 01 e8 77 ee fc ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 88 4c 14 82 e8 a3 fe 84 ff <0f> 0b eb be 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 [ 224.739866] RSP: 0018:ffff88027dd83e40 EFLAGS: 00010286 [ 224.739869] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000102 [ 224.739871] RDX: 0000000080000102 RSI: ffffffff820c8c6c RDI: 00000000ffffffff [ 224.739873] RBP: ffff8802644c1540 R08: 0000000071be9b33 R09: 0000000000000000 [ 224.739874] R10: ffff88027dd83dc0 R11: 0000000000000000 R12: ffff8802644c1588 [ 224.739876] R13: ffff8802644c1160 R14: 0000000000000001 R15: ffff88026a5dc728 [ 224.739878] FS: 00007f18f4887980(0000) GS:ffff88027dd80000(0000) knlGS:0000000000000000 [ 224.739880] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 224.739881] CR2: 00007f4c627ae548 CR3: 000000022ca1a002 CR4: 00000000003606e0 [ 224.739883] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 224.739885] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 224.739886] Call Trace: [ 224.739888] <IRQ> [ 224.739892] ? qdisc_reset+0xe0/0xe0 [ 224.739894] ? qdisc_reset+0xe0/0xe0 [ 224.739897] call_timer_fn+0x93/0x360 [ 224.739903] expire_timers+0xc1/0x1d0 [ 224.739908] run_timer_softirq+0xc7/0x170 [ 224.739916] __do_softirq+0xd9/0x505 [ 224.739923] irq_exit+0xa9/0xc0 [ 224.739926] smp_apic_timer_interrupt+0x9c/0x2d0 [ 224.739929] apic_timer_interrupt+0xf/0x20 [ 224.739931] </IRQ> [ 224.739934] RIP: 0010:delay_tsc+0x2e/0xb0 [ 224.739936] Code: 49 89 fc 55 53 bf 01 00 00 00 e8 6d 2c 78 ff e8 88 9d b6 ff 41 89 c5 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d5 eb 16 f3 90 <bf> 01 00 00 00 e8 48 2c 78 ff e8 63 9d b6 ff 44 39 e8 75 36 0f ae [ 224.740021] RSP: 0018:ffffc900002f7d48 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13 [ 224.740024] RAX: 0000000080000000 RBX: 0000000649565ca9 RCX: 0000000000000001 [ 224.740026] RDX: 0000000080000001 RSI: ffffffff820c8c6c RDI: 00000000ffffffff [ 224.740027] RBP: 00000006493ea9ce R08: 000000005e81e2ee R09: 0000000000000000 [ 224.740029] R10: 0000000000000120 R11: 0000000000000000 R12: 00000000002ad8d6 [ 224.740030] R13: 0000000000000003 R14: 0000000000000004 R15: ffff88025caf5408 [ 224.740040] ? delay_tsc+0x66/0xb0 [ 224.740045] hibernation_debug_sleep+0x1c/0x30 [ 224.740048] hibernation_snapshot+0x2c1/0x690 [ 224.740053] hibernate+0x142/0x2a4 [ 224.740057] state_store+0xd0/0xe0 [ 224.740063] kernfs_fop_write+0x104/0x190 [ 224.740068] __vfs_write+0x31/0x180 [ 224.740072] ? rcu_read_lock_sched_held+0x6f/0x80 [ 224.740075] ? rcu_sync_lockdep_assert+0x29/0x50 [ 224.740078] ? __sb_start_write+0x152/0x1f0 [ 224.740080] ? __sb_start_write+0x168/0x1f0 [ 224.740084] vfs_write+0xbd/0x1a0 [ 224.740088] ksys_write+0x50/0xc0 [ 224.740094] do_syscall_64+0x55/0x190 [ 224.740097] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 224.740099] RIP: 0033:0x7f18f400a281 [ 224.740100] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53 [ 224.740186] RSP: 002b:00007fffd1f4fec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 224.740189] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f18f400a281 [ 224.740190] RDX: 0000000000000004 RSI: 00007f18f448069a RDI: 0000000000000006 [ 224.740192] RBP: 00007fffd1f4fef0 R08: 0000000000000000 R09: 0000000000000000 [ 224.740194] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e795d03400 [ 224.740195] R13: 00007fffd1f50500 R14: 0000000000000000 R15: 0000000000000000 [ 224.740205] irq event stamp: 1582591 [ 224.740207] hardirqs last enabled at (1582590): [<ffffffff810f9f9c>] vprintk_emit+0x4bc/0x4d0 [ 224.740210] hardirqs last disabled at (1582591): [<ffffffff81a0111c>] error_entry+0x7c/0x100 [ 224.740212] softirqs last enabled at (1582568): [<ffffffff81c0034f>] __do_softirq+0x34f/0x505 [ 224.740215] softirqs last disabled at (1582571): [<ffffffff8108c959>] irq_exit+0xa9/0xc0 [ 224.740218] WARNING: CPU: 3 PID: 2982 at net/sched/sch_generic.c:461 dev_watchdog+0x1fd/0x210 [ 224.740219] ---[ end trace 6e41d690e611c338 ]--- Stephen, Any updates on this bug. It is most likely a network device suspend/resume bug. Figure out what device it is, look in kernel MAINTAINERS file and email them (and netdev@vger.kernel.org). Kernel Bugzilla is not really used by network developers. Email sent to Oliver Neukum, netdev@vger.kernel.org. Device details: 0b95:7720 ASIX Electronics Corp. AX88772 USB network card driver seems to be usbnet Last seen this issue on our CI systems CI_DRM_4376 (7 months, 3 weeks / 4365 runs ago). Since this issue was seen 7 months ago, can we close this issue? (In reply to Stephen Hemminger from comment #10) > It is most likely a network device suspend/resume > bug. Figure out what device it is, look in kernel MAINTAINERS file and > email them (and netdev@vger.kernel.org). Kernel Bugzilla is not really > used by network developers. Thanks, this is exactly what we should have been doing a while ago... So, here is the plan of action for the developer that will take care of this issue: - Send a revert of https://cgit.freedesktop.org/drm/drm-intel/commit/?h=topic/core-for-CI&id=d576dd33ecc7238e4dee92baca89d9c5e3543a44 - Follow Stephen's recommendation - Make sure a patch lands - Done! I have done a two round of revert and it seems fine: https://lists.freedesktop.org/archives/intel-gfx-trybot/2019-April/thread.html#70131 Now we need someone that can push to drm-tip's topic/core-for-CI to remove the hax. Daniel? (In reply to Arek Hiler from comment #14) > I have done a two round of revert and it seems fine: > https://lists.freedesktop.org/archives/intel-gfx-trybot/2019-April/thread. > html#70131 > > Now we need someone that can push to drm-tip's topic/core-for-CI to remove > the hax. > > Daniel? The fun is still going on fi-icl-dsi: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6118/fi-icl-dsi/dmesg0.log <4>[ 108.016259] ------------[ cut here ]------------ <6>[ 108.016363] NETDEV WATCHDOG: enx000acd289764 (ax88179_178a): transmit queue 0 timed out <4>[ 108.016397] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x2b3/0x2f0 <4>[ 108.016401] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul snd_hda_codec snd_hwdep ghash_clmulni_intel e1000e snd_hda_core snd_pcm ptp ax88179_178a pps_core usbnet mii mei_me mei prime_numbers <4>[ 108.016444] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U 5.2.0-rc1-CI-CI_DRM_6118+ #1 <4>[ 108.016449] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake Y LPDDR4x T4 RVP TLC, BIOS ICLSFWR1.R00.3102.A00.1903052247 03/05/2019 <4>[ 108.016455] RIP: 0010:dev_watchdog+0x2b3/0x2f0 <4>[ 108.016461] Code: 77 ff e9 42 ff ff ff 4c 89 f7 c6 05 ca 3d a8 00 01 e8 e1 57 fc ff 89 d9 48 89 c2 4c 89 f6 48 c7 c7 c8 8a 14 82 e8 3d cf 81 ff <0f> 0b e9 25 ff ff ff e8 21 45 8a ff 85 c0 75 b6 48 c7 c2 28 5d 07 <4>[ 108.016466] RSP: 0018:ffffc90000003e30 EFLAGS: 00010282 <4>[ 108.016472] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 <4>[ 108.016476] RDX: 0000000000000000 RSI: ffffffff822180a0 RDI: 00000000ffffffff <4>[ 108.016481] RBP: ffff88828f9237e0 R08: 0000000078796510 R09: 0000000000000000 <4>[ 108.016485] R10: ffffc90000003db0 R11: 0000000000000000 R12: ffff88828f923830 <4>[ 108.016489] R13: 0000000000000000 R14: ffff88828f923400 R15: ffff88828f832b98 <4>[ 108.016494] FS: 0000000000000000(0000) GS:ffff88829be00000(0000) knlGS:0000000000000000 <4>[ 108.016499] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 108.016503] CR2: 0000559a66cd0af8 CR3: 0000000005210003 CR4: 0000000000760ef0 <4>[ 108.016508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 108.016512] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4>[ 108.016516] PKRU: 55555554 <4>[ 108.016520] Call Trace: <4>[ 108.016525] <IRQ> <4>[ 108.016533] ? qdisc_destroy+0x110/0x110 <4>[ 108.016541] call_timer_fn+0x98/0x2f0 <4>[ 108.016548] ? qdisc_destroy+0x110/0x110 <4>[ 108.016554] expire_timers+0xdc/0x1b0 <4>[ 108.016563] run_timer_softirq+0xc7/0x170 <4>[ 108.016576] __do_softirq+0xd8/0x4b9 <4>[ 108.016596] irq_exit+0xa9/0xc0 <4>[ 108.016603] smp_apic_timer_interrupt+0x9c/0x250 <4>[ 108.016611] apic_timer_interrupt+0xf/0x20 <4>[ 108.016616] </IRQ> <4>[ 108.016623] RIP: 0010:cpuidle_enter_state+0xae/0x450 <4>[ 108.016629] Code: 44 00 00 31 ff e8 d2 62 92 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 78 03 00 00 31 ff e8 3b 45 99 ff e8 66 46 9d ff fb 45 85 ed <0f> 88 c9 02 00 00 4c 2b 24 24 48 ba cf f7 53 e3 a5 9b c4 20 49 63 <4>[ 108.016633] RSP: 0018:ffffffff82203e58 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 <4>[ 108.016639] RAX: ffffffff82217800 RBX: ffffffff822a1e40 RCX: 0000000000000000 <4>[ 108.016643] RDX: 0000000000000046 RSI: 0000000000000006 RDI: ffffffff8212d1b9 <4>[ 108.016647] RBP: ffff8882977adfa8 R08: 0000000000000000 R09: 0000000000000000 <4>[ 108.016650] R10: 0000000000000000 R11: 0000000000000000 R12: 0000001926403514 <4>[ 108.016654] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003 <4>[ 108.016681] ? cpuidle_enter_state+0xaa/0x450 <4>[ 108.016694] cpuidle_enter+0x24/0x40 <4>[ 108.016704] do_idle+0x1f3/0x260 <4>[ 108.016718] cpu_startup_entry+0x14/0x20 <4>[ 108.016727] start_kernel+0x4b6/0x4d8 <4>[ 108.016738] secondary_startup_64+0xa4/0xb0 <4>[ 108.016755] irq event stamp: 1110527 <4>[ 108.016763] hardirqs last enabled at (1110526): [<ffffffff81128635>] vprintk_emit+0x315/0x340 <4>[ 108.016769] hardirqs last disabled at (1110527): [<ffffffff810019e0>] trace_hardirqs_off_thunk+0x1a/0x1c <4>[ 108.016776] softirqs last enabled at (1110502): [<ffffffff810b6268>] irq_enter+0x58/0x60 <4>[ 108.016782] softirqs last disabled at (1110503): [<ffffffff810b6319>] irq_exit+0xa9/0xc0 <4>[ 108.016788] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x2b3/0x2f0 <4>[ 108.016792] ---[ end trace b815283152d81882 ]--- For context, the above backtrace happened immediately after removing the following patch: https://cgit.freedesktop.org/drm/drm-intel/commit/?h=topic/core-for-CI&id=d576dd33ecc7238e4dee92baca89d9c5e3543a44 Last seen this issue CI_DRM_6124 (11 months, 1 week old) ago. Shall we close this issue? The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |