Bug 201849
Summary: | hw csum failure - reproducible error | ||
---|---|---|---|
Product: | Networking | Reporter: | jkyriannis |
Component: | IPV4 | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | normal | CC: | eugene, lance, saeedm |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 4.19.5-1.el7.elrepo.x86_64 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
jkyriannis
2018-12-03 04:23:36 UTC
I've run into this problem as well and have pinned the problem between 4.14.70 and 4.14.71. I'm currently trying to bisect the problem and will report back once I think I found the commit which added the issue. It seems that I've finally been able to bisect this to commit 88078d98d1bb085d72af8437707279e203524fa5 [1]. This seems to have caused a similar issue in the mlx5 driver that was fixed in d48051c5b8376038c2b287c3b1bd55b8d391d567 [2] and also in the sungem driver in 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f [3]. I suspect that the the mlx4 driver needs similar fixes that mlx5 had to resolve the issue. If you grep through the git log for 88078d98d1bb, you'll see several other references to this issue. I confirmed this issue is still present in the latest 4.20.0-rc7. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88078d98d1bb085d72af8437707279e203524fa5 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d48051c5b8376038c2b287c3b1bd55b8d391d567 [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=12b03558cef6d655d0d394f5e98a6fd07c1f6c0f I got a reply directly from Eric and Dimitris from Google who introduced the patch. See below: On Wed, Dec 19, 2018 at 4:21 PM Eric Dumazet <edumazet@google.com> wrote: > On Wed, Dec 19, 2018 at 4:08 PM Lance Albertson <lance@osuosl.org> wrote: > > On Wed, Dec 19, 2018 at 3:41 PM Dimitris Michailidis <dmichail@google.com> > > wrote: > > > MLNX devices have an issue with packets that are padded past the end of > > > the L3 payload with bytes that aren't all 0s. They use a mode of checksum > > > reporting which should be including the padding bytes but MLNX devices > > > leave those out. When the padding bytes aren't all 0 this omission causes > > > a checksum error. This device behavior has existed for a long time but it > > > has begun causing errors only this year. Before a padded packet had its > HW > > > checksum ignored so it wasn't material what HW had reported. More > recently > > > padded packet checksums started using the HW value and now it is > > > noticeable when that value isn't right. > > > > > > I believe this issue is still being worked on upstream by Mellanox and > > > other people. > > > > Thanks for the quick reply. Is this issue going to be fixed via a firmware > > update on MLNX devices and/or via an update to the kernel driver? Is there > > any public discussion about this on a mailing list where I can track the > > progress? > > We do not know yet Mellanox definitive answer, but last time at LPC I got > word > from Mellanox engineer saying the fix would be in firmware. Hi Lance, the mlx4 fix was submitted lately to v5.0-rc7 and was backported to -stable kernel 4.19.26, so no wonder you see the issues on your kernel. 4190a7bcd2af net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames https://patchwork.ozlabs.org/patch/1039923/ Can you please try this patch ? Oh excellent! I'll give this a try in the next few days and get back to you. Sorry for the delay. I confirmed that the latest 4.19 kernel resolved this issue. Please go ahead and close this issue! |