Bug 211911

Summary: Panic in skb_find_text(), regression in 5.11 vs 5.10
Product: Networking Reporter: Kris Karas (bugs-a21)
Component: Netfilter/IptablesAssignee: networking_netfilter-iptables (networking_netfilter-iptables)
Severity: normal CC: willemb
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.11 Tree: Mainline
Regression: Yes
Attachments: Script to trigger the bug on vulnerable systems

Description Kris Karas 2021-02-24 06:54:15 UTC
Looks as if a regression was introduced in mainline 5.11 in skb_find_text() or its callers, which (I assume) is the routine that performs "-m string" matching in iptables.

I just upgraded my border router (has about 350 iptables rules) from 5.10.18 to 5.11.1, and it panicked a minute after boot.  Rebooted, it panicked about 3 minutes into that second boot.  In each case, the stack trace left in pstore was the same:
Call Trace:
  ? skb_find_text+0xd8/0xe0
  string_mt+0x22/0x2b [xt_string]
  ? ip_forward_options+0x1c0/0x1c0
  ? ip_frag_init+0x50/0x50
  ? _copy_from_user+0x36/0x70
  ? _copy_from_user+0x36/0x70
  ? iovec_from_user+0x58/0x160

None of my other 5.11.1 machines seemed affected.  Then it occurred to me, "find_text", right!  The border router has a rule to DROP outgoing "recursion refused" DNS packets (a common practice to mitigate DNS amplification attacks).  The rule itself is straightforward:
  iptables -A OUTPUT -p udp --sport 53 \
           -m string --algo bm --from 30 --to 32 \
           --hex-string "|8105|" -j DROP
This would make sense from a timing perspective, as my router is hit with DNS ./ANY/IN queries every minute or three.

Checking the changelog for 5.11, I see many changes to files with "skb" in the name.  Hmm.  I bisect might not be a bad idea.
Comment 1 Kris Karas 2021-02-24 23:53:21 UTC
Got it via bisect:

97550f6fa59254435d864b92603de3ca4b5a99f8 is the first bad commit
commit 97550f6fa59254435d864b92603de3ca4b5a99f8
Author: Willem de Bruijn <willemb@google.com>
Date:   Sat Jan 9 17:18:33 2021 -0500

    net: compound page support in skb_seq_read
    skb_seq_read iterates over an skb, returning pointer and length of
    the next data range with each call.
    It relies on kmap_atomic to access highmem pages when needed.
    An skb frag may be backed by a compound page, but kmap_atomic maps
    only a single page. There are not enough kmap slots to always map all
    pages concurrently.
    Instead, if kmap_atomic is needed, iterate over each page.
    As this increases the number of calls, avoid this unless needed.
    The necessary condition is captured in skb_frag_must_loop.
    I tried to make the change as obvious as possible. It should be easy
    to verify that nothing changes if skb_frag_must_loop returns false.
      On an x86 platform with
        ip link set dev lo mtu 1500
        iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
        dd if=/dev/urandom of=in bs=1M count=20
        nc -l -p 8000 > /dev/null &
        nc -w 1 -q 0 localhost 8000 < in
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 28 +++++++++++++++++++++++-----
 2 files changed, 24 insertions(+), 5 deletions(-)
Comment 2 Kris Karas 2021-02-25 00:06:08 UTC
Added Willem to the CC list, as he is the patch author.

Additional notes:
  This machine that is effected is x86_64, so CONFIG_HIGHMEM does not apply.
  Similarly, CONFIG_DEBUG_KMAP=n
Comment 3 Willem de Bruijn 2021-02-25 03:56:59 UTC
Thanks for the report. Apologies for the issue.

To be clear, your config does not have CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y, right?

I structured the patch to be as obvious as possible a NOOP when skb_frag_must_loop returns false.

Your stack shows locally generated UDP replies. These are normal packets, smaller than MTU, no UDP_GSO?

I would expect those to be wholly linear, no frags whatsoever.

On top of that, skb_find_text even bails after offset 32.
Comment 4 Kris Karas 2021-02-25 16:41:22 UTC
Hi Willem,

Indeed, all of the KMAP debug options are set to false.  As this is a border router, kernel config options are set rather conservatively.  It has a Ryzen 5 CPU; and the ethernet interface in question reports itself as a RTL 8168h/8111h (r8169 kernel driver).

I'm not positive about the GSO; I looked for any references to UDP_SEGMENT within the ISC BIND source, and found none, so blindly assume "no" there.  CONFIG_NET_MPLS is unset, as is the _GSO.

The packets themselves are small.  The incoming queries are 75 bytes in length, the outgoing response 70 bytes (including the ethernet frame).  Well below MTU (which is at a default 1500).  No fragmentation was reported by wireshark in any of the packet captures.

As for skb_find_text bailing at offset 32 (I assume the IP, not the frame), that's right at the end of the search target, which is offsets 30 through 31.

Thanks for helping with this!

Comment 5 Willem de Bruijn 2021-02-25 16:48:06 UTC
Thanks for the extra details. Let me try to reproduce it locally.

The __stack_chk_fail means a stack overflow was discovered.

The odd thing is that the patch really only modifies code that touches skb frags. Which these packets should not have.
Comment 6 Kris Karas 2021-02-25 19:03:53 UTC
I agree that this sounds odd, given that no frags are involved.   I should have been more rigorous, and recompiled 5.11.1 with just this one commit backed out, to prove that the "git bisect" didn't pick the wrong commit.  I'll try that later today after my 8yo is done with her Zoom school; don't want to crash her video feed.  :-)
Comment 7 Kris Karas 2021-02-26 05:41:28 UTC
Just to follow up on comment 6, my router is now running vanilla 5.11.1 with 97550f6fa592 backed out, has seen the iptables string match rule hit 161 times as I pen this, and no OOPSen or other dmesg output has occurred.  So this is definitely the errant patch, even though it doesn't sound likely given the gating factors mentioned in comment 3.
Comment 8 Willem de Bruijn 2021-02-26 14:24:40 UTC
That's pretty clear. I hope no one's zoom school was impacted :)

I haven't gotten to writing a reproducer yet. Will do over the weekend and get back.
Comment 9 Kris Karas 2021-02-27 02:42:59 UTC
Created attachment 295499 [details]
Script to trigger the bug on vulnerable systems
Comment 10 Kris Karas 2021-02-27 02:46:47 UTC

I've written a test script for you that demonstrates the bug, uploaded above.  It uses "nping" (usually bundled with the "nmap" package) to generate raw test packets, and also iptables (or ip6tables) to dynamically create a rule to catch the nping output.  Script is a bit verbose with sanity-checking, but should clean up after itself.

Comment 11 Willem de Bruijn 2021-02-27 02:47:57 UTC
Thanks a lot, Kris! Looking into it right now. The nping reproducer will help a lot.
Comment 12 Willem de Bruijn 2021-02-27 05:48:26 UTC
Commit 97550f6fa592 ("net: compound page support in skb_seq_read") adds 4B field frag_off to skb_seq_state, growing it from 40 to 48B on 64-bit platforms.

This state is stored in the control block, as shown by TS_SKB_CB().

I assumed that this was like other .._SKB_CB macros and points to skb->cb[], which is 48B.

But in skb_find_text it points into stack allocated ts_state:

    struct ts_state
        unsigned int        offset;
        char                cb[40];

where cb is 40 bytes, so the write to frag_off corrupts the stack.

At first attempt this fixes it:

    struct ts_state
        unsigned int        offset;
-        char                cb[40];
+        char                cb[40];

Will take a closer look, and write it in such a way that this does not happen again next time skb_seq_state grows.
Comment 13 Willem de Bruijn 2021-02-27 05:51:46 UTC
the from 40 to 48 is not a typo: 4B field + 4B padding:

$ pahole -C skb_seq_state net/core/skbuff.o
struct skb_seq_state {
        __u32                      lower_offset;         /*     0     4 */
        __u32                      upper_offset;         /*     4     4 */
        __u32                      frag_idx;             /*     8     4 */
        __u32                      stepped_offset;       /*    12     4 */
        struct sk_buff *           root_skb;             /*    16     8 */
        struct sk_buff *           cur_skb;              /*    24     8 */
        __u8 *                     frag_data;            /*    32     8 */
        __u32                      frag_off;             /*    40     4 */

        /* size: 48, cachelines: 1, members: 8 */
        /* padding: 4 */
        /* last cacheline: 48 bytes */
Comment 14 Kris Karas 2021-02-28 00:55:34 UTC
I'm being an idiot here, but, would something such as:
  struct ts_state
      unsigned int    offset;
      char            cb[sizeof(struct skb_seq_state)];
Comment 15 Willem de Bruijn 2021-02-28 03:05:46 UTC
Yes it would. Not an idiotic suggestion!

But I think the textsearch infra in include/linux/textsearch.h is intended not to depend on skbuffs. Even if in practice it has no other user.

We can avoid introducing the dependency by doing the reverse where both structs are known. In skb_find_text:

BUILD_BUG_ON(sizeof(struct skb_seq_state) > sizeof(state.cb));

I'd rather get rid of the weird untyped state.cb altogether. Taking a look at that. But likely comment 12 + this is the shortest patch and therefore most suitable solution for stable branches.
Comment 16 Kris Karas 2021-02-28 05:56:19 UTC
That's an honorable goal, to keep unexpected dependencies out of the kernel.  Kudos for that!  I can say with a bit of a laugh that, having run custom-configured kernels in production environments since late 1993, that the kernel is resplendent with unexpected dependencies.

Reading the code, there are zero comments that accompany skb_seq_state in include/linux/skbuff.h, which is to say, not one reference to downstream users requiring synchronous code updates.  I think each owner of downstream code that embeds anonymous and undocumented size_t's into their own code owes it to the community to add a comment before the upstream structure, something like this:
  /* When adding, removing, or re-arranging struct skb_seq_state,
   * please update "char cb[40];" within "struct ts_state" in file
   * include/linux/textsearch.h so that the "40" matches your new
   * sizeof(struct skb_seq_state).  -- John Q. Maintainer
That way, authors who need to update skb_seq_state will know what else they need to tweak when writing the patch.  As it is, it is maddening that there is no definitive way to know who else's code you need to edit.

Rant aside, I wonder if there are other undocumented users of sizeof(struct skb_seq_state) besides textsearch?  In particular, in 5.11, netconsole doesn't work on any one of my systems.  I see several that all report,

On the system I'm writing this on now, I see this in the logs:
  springdale kernel: igb_poll+0x0/0x1240 exceeded budget in poll
  springdale kernel: Call Trace:
  springdale kernel:  __netpoll_send_skb+0x1d1/0x230
  springdale kernel:  netpoll_send_skb+0x11/0x30
  springdale kernel:  br_dev_xmit+0x248/0x3e0
  springdale kernel:  netpoll_start_xmit+0x110/0x1b0
  springdale kernel:  __netpoll_send_skb+0x14b/0x230
  springdale kernel:  netpoll_send_udp+0x2b3/0x3f0
  springdale kernel:  write_msg+0x121/0x140

Might be just a coincidence, something else affecting socket buffers.
Comment 17 Willem de Bruijn 2021-02-28 14:03:51 UTC
That sounds like a separate unrelated issue.

The BUILD_BUG_ON is intended to be a more strict, because also machine verifiable, version of documentation. Coupled with git blame pointing to a git commit message that can be much more detailed than inline comments.

When I send a patch, I'll add you as Reported-by:  . Let me know if you would rather I do not.
Comment 18 Willem de Bruijn 2021-03-02 19:34:47 UTC
The commit has been merged in time for 5.11.3
Comment 19 Willem de Bruijn 2021-03-02 19:39:01 UTC
Where the commit is

author	Willem de Bruijn <willemb@google.com>	2021-03-01 15:09:44 +0000
committer	David S. Miller <davem@davemloft.net>	2021-03-01 15:25:24 -0800
commit	b228c9b058760500fda5edb3134527f629fc2dc3 (patch)
tree	da9c298ac2224e735f2d0f56c28a48e7324c8701
parent	2353db75c3db1dd26ff9c8feccfd3543a9cb73be (diff)

net: expand textsearch ts_state to fit skb_seq_state
The referenced commit expands the skb_seq_state used by
skb_find_text with a 4B frag_off field, growing it to 48B.

This exceeds container ts_state->cb, causing a stack corruption:

[   73.238353] Kernel panic - not syncing: stack-protector: Kernel stack
is corrupted in: skb_find_text+0xc5/0xd0
[   73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4
[   73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[   73.260078] Call Trace:
[   73.264677]  dump_stack+0x57/0x6a
[   73.267866]  panic+0xf6/0x2b7
[   73.270578]  ? skb_find_text+0xc5/0xd0
[   73.273964]  __stack_chk_fail+0x10/0x10
[   73.277491]  skb_find_text+0xc5/0xd0
[   73.280727]  string_mt+0x1f/0x30
[   73.283639]  ipt_do_table+0x214/0x410

The struct is passed between skb_find_text and its callbacks
skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through
the textsearch interface using TS_SKB_CB.

I assumed that this mapped to skb->cb like other .._SKB_CB wrappers.
skb->cb is 48B. But it maps to ts_state->cb, which is only 40B.

skb->cb was increased from 40B to 48B after ts_state was introduced,
in commit 3e3850e989c5 ("[NETFILTER]: Fix xfrm lookup in

Increase ts_state.cb[] to 48 to fit the struct.

Also add a BUILD_BUG_ON to avoid a repeat.

The alternative is to directly add a dependency from textsearch onto
linux/skbuff.h, but I think the intent is textsearch to have no such
dependencies on its callers.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911
Fixes: 97550f6fa592 ("net: compound page support in skb_seq_read")
Reported-by: Kris Karas <bugs-a17@moonlit-rail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Comment 20 Kris Karas 2021-03-03 06:43:17 UTC
Excellent.  Thank you, Willem!

I am running the diff against the vanilla 5.11.2, and just as I have started typing, the text match has hit 50 times with no complaints.

Shall we label this bug as Closed/Code-Fix ?
Comment 21 Willem de Bruijn 2021-03-03 13:17:20 UTC
Thanks for verifying Kris. Yes, sounds good.
Comment 22 Kris Karas 2021-03-07 19:10:10 UTC
It appears that this patch, while in netdev, has not made it to 5.11.
Both 5.11.3 and now 5.11.4 are missing it.
Can we push this to Greg KH's queue?
Comment 23 Willem de Bruijn 2021-03-08 15:57:00 UTC
Good catch.

The patch is in the stable queue, waiting to be merged


For details on stable backports for networking, see also


(I had to myself, to recall the URL of the stable queue)