Bug 212573
Summary: | netconsole triggers warning in netpoll_poll_dev | ||
---|---|---|---|
Product: | Networking | Reporter: | Oleksandr Natalenko (oleksandr) |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | bugs-a21 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | https://git.kernel.org/netdev/net/c/eaeace60778e | ||
Kernel Version: | 5.12.2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | 16eb8815c235 |
Attachments: | Fix from Jesse Brandeburg of Intel |
Description
Oleksandr Natalenko
2021-04-06 08:17:46 UTC
[22038.710801] igb_poll+0x0/0x1440 [igb] exceeded budget in poll [22038.710802] WARNING: CPU: 12 PID: 40362 at net/core/netpoll.c:155 WARN_ONCE(work, "%pS exceeded budget in poll\n", napi->poll); Sounds like a driver issue. You may want to bring this up on the intel-wired-lan list. Thanks for the suggestion, posted an email here: [1] [1] https://lore.kernel.org/lkml/20210406123619.rhvtr73xwwlbu2ll@spock.localdomain/ The logic in igb_poll() is going to do a receive of one packet even when called with a budget of zero. It is a off by one in the logic of igb_clean_rx_irq() here: while (likely(total_packets < budget)) { I have seen this regularly for a long time now. It is not specific to suspend/resume. The stack trace I see is largely the same as of the OP, but with the addition of bridge devices. On every boot, the system goes to perform a CIFS mount of \\Server\Share, and triggers this: igb_poll+0x0/0x1290 exceeded budget in poll __netpoll_send_skb+0x1d1/0x230 netpoll_send_skb+0x11/0x30 br_dev_xmit+0x248/0x3e0 netpoll_start_xmit+0x110/0x1b0 __netpoll_send_skb+0x14b/0x230 netpoll_send_udp+0x2b3/0x3f0 write_msg+0x121/0x140 console_unlock+0x34d/0x430 vprintk_emit+0x10e/0x1a0 printk+0x53/0x6a cifs_smb3_do_mount.cold+0x2f/0x60 [cifs] [etc] And of course, the outgoing packet is never sent over the wire. Only one such WARN is printed per system uptime. Kris, it seems the issue pops up for you more often than it does for me. Would it be possible for you to verify suggested patch [1]? Thanks. [1] https://lore.kernel.org/lkml/20210406114734.0e00cb2f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ Incidentally – and this comment is /off topic/ but included for completeness – these netconsole events have not been logged to the syslog server I have in a long time. It seems that the sysklogd package (from Debian/Ubuntu/NetBSD/FreeBSD) has recently started adhering more closely to RFC3164 and RFC5424, one side-effect being that improperly formatted syslog UDP packages on port 514 are discarded. And netconsole has been improperly formatting packets for quite some time, specifically in omitting the "<PRI>" string required at the start of each submission. Curiously, <PRI> is included and is parsed by "dmesg" to colorize output (etc), but gets stripped (or not generated?) when going via netconsole. I will give the patch in comment #5 a whirl when I get a free moment here... I just tested the patch in comment #5 and am running it as I pen this. Alas, it made no difference. The warning and stack trace appear identical. P.S. A correction: I stated incorrectly in comment #4 that packets were not being pushed out by netconsole with the igb driver. They are indeed sent, and I am (with a patched syslogd to work around the issue in comment #6) receiving them on the syslog server. Created attachment 296685 [details]
Fix from Jesse Brandeburg of Intel
Kris, would it be possible for you to test the patch I've just attached? Thanks. I was hoping that the new patch might work, but, alas, no joy. Applying the fix from Jesse Brandenburg results in no visible changes, same stack trace as before. (This was tested against mainline 5.12.1) OK, it does not work for me as well. I've just given a feedback on this patch to the developer. Thanks. Oleksandr, This bug is still in state NEW and relevant, as this WARNING is issued on every boot the first time that netconsole goes to transmit a message. I am concerned that the people submitting patches, e.g. those in comment 5 and comment 10, are being worked with off-line, and have no connection to bug 212573. These people (and any relevant mailing lists) should be added to the CC list of this bug; please do so. This bug is a regression, so the metadata at the top should be updated to reflect this. The kernel version listed, 5.12.2, is incorrect; the bug was introduced at the start of the relevant series. (Was it introduced between 5.9 and 5.10? It's been so long now, I have forgotten.) Hopefully, after getting the metadata right, some interest will be generated in the appropriate developer community. I'm not even sure kernel BZ is a right place to discuss issues like this. LKML serves the purpose better, and the discussion was revived recently: [1]. Feel free to chime in (I gave up on it because my emails on proposed patches were ignored). [1] https://lore.kernel.org/lkml/DM6PR12MB45165BFF3AB84602238FA595D89B9@DM6PR12MB4516.namprd12.prod.outlook.com/ Possible fix: [1] [1] https://lore.kernel.org/netdev/DM6PR12MB451635351CFBBD86059A0078D8619@DM6PR12MB4516.namprd12.prod.outlook.com/T/#mdf4cb6ce507162f97b59882ef5c9bfeb2b48d8d7 Marking this as closed and fixed. |