218033 – kernel tried to execute NX-protected page - exploit attempt? (uid: 0)

Bug 218033 - kernel tried to execute NX-protected page - exploit attempt? (uid: 0)

Summary: kernel tried to execute NX-protected page - exploit attempt? (uid: 0)

Status:	RESOLVED PATCH_ALREADY_AVAILABLE

Alias:	None

Product:	Memory Management
Classification:	Unclassified
Component:	Page Allocator (show other bugs)
Hardware:	Intel Linux

Importance:	P3 normal
Assignee:	Andrew Morton

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-10-21 17:31 UTC by CM76
Modified:	2023-10-25 11:37 UTC (History)
CC List:	1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
dmesg.202310211711 (95.21 KB, text/plain) 2023-10-21 17:34 UTC, CM76	Details
dmesg.202310180543 (69.90 KB, text/plain) 2023-10-21 20:38 UTC, CM76	Details
dmesg.202310221752 (95.53 KB, text/plain) 2023-10-22 16:38 UTC, CM76	Details
Add an attachment (proposed patch, testcase, etc.)

Description CM76 2023-10-21 17:31:43 UTC

I believe this is also an issue with the Broadcom bnx2 drivers since it only seem to happen when I enable "tx-nocache-copy" in ethtool.  

The issue started when I was running Mainline/stable Kernel v6.5.x on another machine, after google-ing a bit I landed on an article from redhat that pointed at the possibility of an issue caused by a failing hardware. I was renting the server, so I didn't bother to fill a bug report and assumed it was the server that was going bad. But then it happened again on my other server as soon as I switched the bittorrent client to the same I was using on that other server. I turned "tx-nocache-copy" off and ran mainline kernel v6.5 (on Ubuntu 23.04) for a day or two without issue. After that I switched the kernel back to Ubuntu's kernel (v6.2) and the server ran for a couple more days without issue. Two days ago I turned "tx-nocache-copy" on again out of curiosity (kernel v6.2), and the server didn't run into any issue with this setting set to on. This morning I upgraded to Ubuntu 23.10 that runs their version of Kernel v6.5. The kernel panicked and server rebooted a couple of hours later. 


The issue seem to be triggered with a certain configuration of applications, I've ran Mainline/stable kernel 6.5.x since its release (and before that v6.4.x) with the rtorrent bittorrent client and "tx-nocache-copy" turned on, the kernel didn't run into any issue for weeks until I switched to another bittorrent client (qbittorrent) last week. It doesn't seem to matter when it happens, the kernel can Opps when the client is downloading a single small sized torrent to when it's downloading multiple torrents at the same time. 


I tried to use the crash utiliy to get the backtrace but it doesn't seem to work correctly. I get "crash: invalid structure member offset: module_core_size FILE: kernel.c  LINE: 3781  FUNCTION: module_init()" when I try to load the kernel dump. 

The kernel panic happens with 6.5.x Mainline/stable kernel as well as the 6.5 kernel that comes with ubuntu 23.10.

The bittorrent clients run as systemd services with normal user privileges and "ProtectKernelModules=yes" "NoNewPrivileges=yes" set in the systemd service. 

I joined the full dmesg as attachement, and I can send the kdump generated kernel dump file if needed. 


------------------------
[12090.273551] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[12090.273577] BUG: unable to handle page fault for address: ffff9441c9734458
[12090.273590] #PF: supervisor instruction fetch in kernel mode
[12090.273602] #PF: error_code(0x0011) - permissions violation
[12090.273614] PGD 157401067 P4D 157401067 PUD 23ffff067 PMD 108a81063 PTE 8000000109734063
[12090.273632] Oops: 0011 [#1] PREEMPT SMP PTI
[12090.273643] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 6.5.0-9-generic #9-Ubuntu
[12090.273658] Hardware name: Dell Inc. PowerEdge R210 II/03X6X0, BIOS 2.10.0 05/24/2018
[12090.273674] RIP: 0010:0xffff9441c9734458
[12090.273694] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 58 44 73 c9 41 94 ff ff 00 00 00 00 00 00
[12090.273723] RSP: 0018:ffffb3c380138980 EFLAGS: 00010282
[12090.273734] RAX: ffff9441c9734458 RBX: ffff9441c9734400 RCX: 0000000000000000
[12090.273746] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9441c9734400
[12090.273758] RBP: ffffb3c380138990 R08: 0000000000000000 R09: 0000000000000000
[12090.273771] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9441c9734400
[12090.273783] R13: 00000000000005dc R14: ffff9441c49dda00 R15: ffffffff9e55ec40
[12090.273795] FS:  0000000000000000(0000) GS:ffff9442f7c40000(0000) knlGS:0000000000000000
[12090.273811] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12090.273823] CR2: ffff9441c9734458 CR3: 0000000155a3a006 CR4: 00000000001706e0
[12090.273837] Call Trace:
[12090.273845]  <IRQ>
[12090.273851]  ? show_regs+0x6d/0x80
[12090.273864]  ? __die+0x24/0x80
[12090.273873]  ? page_fault_oops+0x99/0x1b0
[12090.273884]  ? kernelmode_fixup_or_oops+0xb2/0x140
[12090.273896]  ? __bad_area_nosemaphore+0x1a5/0x2c0
[12090.273908]  ? bad_area_nosemaphore+0x16/0x30
[12090.273918]  ? do_kern_addr_fault+0x7b/0xa0
[12090.273927]  ? exc_page_fault+0x1a4/0x1b0
[12090.273939]  ? asm_exc_page_fault+0x27/0x30
[12090.273952]  ? skb_release_head_state+0x27/0xb0
[12090.273964]  consume_skb+0x33/0xf0
[12090.273973]  tcp_mtu_probe+0x565/0x5d0
[12090.273984]  tcp_write_xmit+0x579/0xab0
[12090.273994]  __tcp_push_pending_frames+0x37/0x110
[12090.274005]  tcp_rcv_established+0x264/0x730
[12090.274015]  ? security_sock_rcv_skb+0x39/0x60
[12090.274027]  tcp_v4_do_rcv+0x169/0x2a0
[12090.274037]  tcp_v4_rcv+0xd92/0xe00
[12090.274046]  ? raw_v4_input+0xaa/0x240
[12090.274056]  ip_protocol_deliver_rcu+0x3c/0x210
[12090.274068]  ip_local_deliver_finish+0x77/0xa0
[12090.274078]  ip_local_deliver+0x6e/0x120
[12090.274089]  ? __pfx_ip_local_deliver_finish+0x10/0x10
[12090.274369]  ip_sublist_rcv_finish+0x6f/0x80
[12090.274638]  ip_sublist_rcv+0x171/0x220
[12090.274931]  ? __pfx_ip_rcv_finish+0x10/0x10
[12090.275201]  ip_list_rcv+0x102/0x140
[12090.275459]  __netif_receive_skb_list_core+0x22d/0x250
[12090.275714]  netif_receive_skb_list_internal+0x1a3/0x2d0
[12090.275967]  napi_complete_done+0x74/0x1c0
[12090.276218]  bnx2_poll_msix+0xa1/0xe0 [bnx2]
[12090.276468]  __napi_poll+0x33/0x1f0
[12090.276708]  net_rx_action+0x181/0x2e0
[12090.276943]  __do_softirq+0xd9/0x346
[12090.277172]  ? handle_irq_event+0x52/0x80
[12090.277393]  ? handle_edge_irq+0xda/0x250
[12090.277604]  __irq_exit_rcu+0x75/0xa0
[12090.277812]  irq_exit_rcu+0xe/0x20
[12090.278015]  common_interrupt+0xa4/0xb0
[12090.278217]  </IRQ>
[12090.278411]  <TASK>
[12090.278602]  asm_common_interrupt+0x27/0x40
[12090.278798] RIP: 0010:cpuidle_enter_state+0xda/0x730
[12090.278992] Code: 11 04 ff e8 a8 f5 ff ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 26 bb 02 ff 80 7d d0 00 0f 85 61 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 f7 01 00 00 4d 63 ee 49 83 fd 0a 0f 83 17 05 00 00
[12090.279402] RSP: 0018:ffffb3c3800cbe18 EFLAGS: 00000246
[12090.279612] RAX: 0000000000000000 RBX: ffff9442f7c7ec00 RCX: 0000000000000000
[12090.279827] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[12090.280042] RBP: ffffb3c3800cbe68 R08: 0000000000000000 R09: 0000000000000000
[12090.280259] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d0d24a0
[12090.280478] R13: 0000000000000003 R14: 0000000000000003 R15: 00000afefc75867b
[12090.280698]  ? cpuidle_enter_state+0xca/0x730
[12090.280918]  ? finish_task_switch.isra.0+0x89/0x2b0
[12090.281142]  cpuidle_enter+0x2e/0x50
[12090.281363]  call_cpuidle+0x23/0x60
[12090.281583]  cpuidle_idle_call+0x11d/0x190
[12090.281804]  do_idle+0x82/0xf0
[12090.282022]  cpu_startup_entry+0x1d/0x20
[12090.282240]  start_secondary+0x129/0x160
[12090.282460]  secondary_startup_64_no_verify+0x17e/0x18b
[12090.282685]  </TASK>
[12090.282902] Modules linked in: tcp_diag inet_diag ip6table_filter ip6_tables xt_LOG nf_log_syslog xt_recent xt_limit xt_tcpudp xt_conntrack iptable_filter xt_CT xt_set iptable_raw bpfilter ip_set_hash_ip ip_set_hash_net ip_set_hash_ipport ip_set_list_set ip_set_bitmap_port ip_set_hash_netiface ip_set nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl intel_cstate ipmi_ssif mgag200 drm_shmem_helper cfg80211 input_leds drm_kms_helper dcdbas at24 i2c_i801 lpc_ich i2c_smbus ie31200_edac acpi_ipmi i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler sch_fq tcp_bbr nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul ahci mpt3sas libahci raid_class bnx2 scsi_transport_sas wmi
[12090.285082] CR2: ffff9441c9734458
----

Comment 1 CM76 2023-10-21 17:34:44 UTC

Created attachment 305274 [details]
dmesg.202310211711

Comment 2 CM76 2023-10-21 20:38:04 UTC

I managed to load the dump in crash on another machine. I also attached the dmesg of the crash dump that happened when I was running the Mainline/Stable version Kernel v6.5. I attached the same two dmesg earlier by mistake. 



-------------------
crash> set -p
    PID: 0
COMMAND: "swapper/1"
   TASK: ffff9441c0958000  (1 of 4)  [THREAD_INFO: ffff9441c0958000]
    CPU: 1
  STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0        TASK: ffff9441c0958000  CPU: 1    COMMAND: "swapper/1"
 #0 [ffffb3c380138610] machine_kexec at ffffffff9acafa3b
 #1 [ffffb3c380138670] __crash_kexec at ffffffff9ae133f3
 #2 [ffffb3c380138738] crash_kexec at ffffffff9ae14de2
 #3 [ffffb3c380138748] oops_end at ffffffff9ac52131
 #4 [ffffb3c380138770] page_fault_oops at ffffffff9acc77b0
 #5 [ffffb3c3801387d0] kernelmode_fixup_or_oops at ffffffff9acc7962
 #6 [ffffb3c380138810] __bad_area_nosemaphore at ffffffff9acc7ba5
 #7 [ffffb3c380138868] bad_area_nosemaphore at ffffffff9acc7ce6
 #8 [ffffb3c380138878] do_kern_addr_fault at ffffffff9acc7d8b
 #9 [ffffb3c3801388a0] exc_page_fault at ffffffff9bd41864
#10 [ffffb3c3801388d0] asm_exc_page_fault at ffffffff9be00bc7
    [exception RIP: unknown or invalid address]
    RIP: ffff9441c9734458  RSP: ffffb3c380138980  RFLAGS: 00010282
    RAX: ffff9441c9734458  RBX: ffff9441c9734400  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff9441c9734400
    RBP: ffffb3c380138990   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff9441c9734400
    R13: 00000000000005dc  R14: ffff9441c49dda00  R15: ffffffff9e55ec40
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#11 [ffffb3c380138980] skb_release_head_state at ffffffff9ba16117
#12 [ffffb3c380138998] consume_skb at ffffffff9ba18c13
#13 [ffffb3c3801389b0] tcp_mtu_probe at ffffffff9bb26405
#14 [ffffb3c380138a00] tcp_write_xmit at ffffffff9bb269f9
#15 [ffffb3c380138a68] __tcp_push_pending_frames at ffffffff9bb26f77
#16 [ffffb3c380138a88] tcp_rcv_established at ffffffff9bb1edf4
#17 [ffffb3c380138ad8] tcp_v4_do_rcv at ffffffff9bb30169
#18 [ffffb3c380138b00] tcp_v4_rcv at ffffffff9bb32482
#19 [ffffb3c380138b80] ip_protocol_deliver_rcu at ffffffff9baf424c
#20 [ffffb3c380138bb8] ip_local_deliver_finish at ffffffff9baf44a7
#21 [ffffb3c380138bd8] ip_local_deliver at ffffffff9baf454e
#22 [ffffb3c380138c38] ip_sublist_rcv_finish at ffffffff9baf467f
#23 [ffffb3c380138c58] ip_sublist_rcv at ffffffff9baf4811
#24 [ffffb3c380138ce0] ip_list_rcv at ffffffff9baf4c62
#25 [ffffb3c380138d48] __netif_receive_skb_list_core at ffffffff9ba3d12d
#26 [ffffb3c380138dc8] netif_receive_skb_list_internal at ffffffff9ba3d763
#27 [ffffb3c380138e38] napi_complete_done at ffffffff9ba3df24
#28 [ffffb3c380138e68] bnx2_poll_msix at ffffffffc02cb121 [bnx2]
#29 [ffffb3c380138ea0] __napi_poll at ffffffff9ba3e0b3
#30 [ffffb3c380138ed8] net_rx_action at ffffffff9ba3e631
#31 [ffffb3c380138f60] __do_softirq at ffffffff9bd5a349
#32 [ffffb3c380138fd0] __irq_exit_rcu at ffffffff9acff925
#33 [ffffb3c380138fe0] irq_exit_rcu at ffffffff9acffc7e
#34 [ffffb3c380138ff0] common_interrupt at ffffffff9bd3d724
--- <IRQ stack> ---
#35 [ffffb3c3800cbd68] asm_common_interrupt at ffffffff9be00e27
    [exception RIP: cpuidle_enter_state+218]
    RIP: ffffffff9bd4239a  RSP: ffffb3c3800cbe18  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff9442f7c7ec00  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffffb3c3800cbe68   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffffff9d0d24a0
    R13: 0000000000000003  R14: 0000000000000003  R15: 00000afefc75867b
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#36 [ffffb3c3800cbe70] cpuidle_enter at ffffffff9b9ac56e
#37 [ffffb3c3800cbe98] call_cpuidle at ffffffff9ad68843
#38 [ffffb3c3800cbea8] cpuidle_idle_call at ffffffff9ad6e0fd
#39 [ffffb3c3800cbee8] do_idle at ffffffff9ad6e202
#40 [ffffb3c3800cbf08] cpu_startup_entry at ffffffff9ad6e48d
#41 [ffffb3c3800cbf20] start_secondary at ffffffff9ac9e6c9
#42 [ffffb3c3800cbf50] secondary_startup_64_no_verify at ffffffff9ac00263
crash> dis -rl 0xffff9441c9734458
dis: WARNING: ffff9441c9734458: no associated kernel symbol found
   0xffff9441c9734458:  add    %al,(%rax)

crash> kmem 0xffff9441c9734458
CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
ffff9441c0e46c00      512       1114      1504     94     8k  skbuff_fclone_cache
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffdf6ec425cd00  ffff9441c9734000     0     16         15     1
  FREE / [ALLOCATED]
  [ffff9441c9734400]

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffdf6ec425cd00 109734000 dead000000000004        0  1 17ffffc0010200 slab,head
crash>
----------------------

Comment 3 CM76 2023-10-21 20:38:52 UTC

Created attachment 305276 [details]
dmesg.202310180543

Comment 4 CM76 2023-10-22 16:28:27 UTC

Probably has nothing to do with the Broadcom bnx2 driver. The server crashed with "tx-nocache-copy" set to off. 

I added the dmesg as attachment, the backtrace and kmem of the RIP address are below.

I ran qbittorrent on a different server with the same hardware config back in June this year, the server was running Mainline/Stable kernel version 6.3.x then 6.4.0 and the server never rebooted once.

----------------------------
crash> bt
PID: 0        TASK: ffff90b60095b300  CPU: 1    COMMAND: "swapper/1"
 #0 [ffffb9fdc0138610] machine_kexec at ffffffffa48afa3b
 #1 [ffffb9fdc0138670] __crash_kexec at ffffffffa4a133f3
 #2 [ffffb9fdc0138738] crash_kexec at ffffffffa4a14de2
 #3 [ffffb9fdc0138748] oops_end at ffffffffa4852131
 #4 [ffffb9fdc0138770] page_fault_oops at ffffffffa48c77b0
 #5 [ffffb9fdc01387d0] kernelmode_fixup_or_oops at ffffffffa48c7962
 #6 [ffffb9fdc0138810] __bad_area_nosemaphore at ffffffffa48c7ba5
 #7 [ffffb9fdc0138868] bad_area_nosemaphore at ffffffffa48c7ce6
 #8 [ffffb9fdc0138878] do_kern_addr_fault at ffffffffa48c7d8b
 #9 [ffffb9fdc01388a0] exc_page_fault at ffffffffa5941864
#10 [ffffb9fdc01388d0] asm_exc_page_fault at ffffffffa5a00bc7
    [exception RIP: unknown or invalid address]
    RIP: ffff90b602a3ca58  RSP: ffffb9fdc0138980  RFLAGS: 00010282
    RAX: ffff90b602a3ca58  RBX: ffff90b602a3ca00  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff90b602a3ca00
    RBP: ffffb9fdc0138990   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff90b602a3ca00
    R13: 00000000000005c8  R14: ffff90b6035f4800  R15: ffffffffa815ec40
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#11 [ffffb9fdc0138980] skb_release_head_state at ffffffffa5616117
#12 [ffffb9fdc0138998] consume_skb at ffffffffa5618c13
#13 [ffffb9fdc01389b0] tcp_mtu_probe at ffffffffa5726405
#14 [ffffb9fdc0138a00] tcp_write_xmit at ffffffffa57269f9
#15 [ffffb9fdc0138a68] __tcp_push_pending_frames at ffffffffa5726f77
#16 [ffffb9fdc0138a88] tcp_rcv_established at ffffffffa571edf4
#17 [ffffb9fdc0138ad8] tcp_v4_do_rcv at ffffffffa5730169
#18 [ffffb9fdc0138b00] tcp_v4_rcv at ffffffffa5732482
#19 [ffffb9fdc0138b80] ip_protocol_deliver_rcu at ffffffffa56f424c
#20 [ffffb9fdc0138bb8] ip_local_deliver_finish at ffffffffa56f44a7
#21 [ffffb9fdc0138bd8] ip_local_deliver at ffffffffa56f454e
#22 [ffffb9fdc0138c38] ip_sublist_rcv_finish at ffffffffa56f467f
#23 [ffffb9fdc0138c58] ip_sublist_rcv at ffffffffa56f4811
#24 [ffffb9fdc0138ce0] ip_list_rcv at ffffffffa56f4c62
#25 [ffffb9fdc0138d48] __netif_receive_skb_list_core at ffffffffa563d12d
#26 [ffffb9fdc0138dc8] netif_receive_skb_list_internal at ffffffffa563d763
#27 [ffffb9fdc0138e38] napi_complete_done at ffffffffa563df24
#28 [ffffb9fdc0138e68] bnx2_poll_msix at ffffffffc056e121 [bnx2]
#29 [ffffb9fdc0138ea0] __napi_poll at ffffffffa563e0b3
#30 [ffffb9fdc0138ed8] net_rx_action at ffffffffa563e631
#31 [ffffb9fdc0138f60] __do_softirq at ffffffffa595a349
#32 [ffffb9fdc0138fd0] __irq_exit_rcu at ffffffffa48ff925
#33 [ffffb9fdc0138fe0] irq_exit_rcu at ffffffffa48ffc7e
#34 [ffffb9fdc0138ff0] common_interrupt at ffffffffa593d724
--- <IRQ stack> ---
#35 [ffffb9fdc00cbd68] asm_common_interrupt at ffffffffa5a00e27
    [exception RIP: cpuidle_enter_state+218]
    RIP: ffffffffa594239a  RSP: ffffb9fdc00cbe18  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff90b737c7ec00  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffffb9fdc00cbe68   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffffffa6cd24a0
    R13: 0000000000000004  R14: 0000000000000004  R15: 0000509709bfcf38
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#36 [ffffb9fdc00cbe70] cpuidle_enter at ffffffffa55ac56e
#37 [ffffb9fdc00cbe98] call_cpuidle at ffffffffa4968843
#38 [ffffb9fdc00cbea8] cpuidle_idle_call at ffffffffa496e0fd
#39 [ffffb9fdc00cbee8] do_idle at ffffffffa496e202
#40 [ffffb9fdc00cbf08] cpu_startup_entry at ffffffffa496e48d
#41 [ffffb9fdc00cbf20] start_secondary at ffffffffa489e6c9
#42 [ffffb9fdc00cbf50] secondary_startup_64_no_verify at ffffffffa4800263

crash> kmem ffff90b602a3ca58
CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
ffff90b600e46d00      512        771       864     54     8k  skbuff_fclone_cache
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  fffffac1440a8f00  ffff90b602a3c000     0     16         12     4
  FREE / [ALLOCATED]
  [ffff90b602a3ca00]

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
fffffac1440a8f00 102a3c000 dead000000000001        0  1 17ffffc0010200 slab,head
crash>

Comment 5 CM76 2023-10-22 16:38:47 UTC

Created attachment 305278 [details]
dmesg.202310221752

Comment 6 Bagas Sanjaya 2023-10-24 00:25:56 UTC

(In reply to CM76 from comment #0)
> I believe this is also an issue with the Broadcom bnx2 drivers since it only
> seem to happen when I enable "tx-nocache-copy" in ethtool.  
> 
> The issue started when I was running Mainline/stable Kernel v6.5.x on
> another machine, after google-ing a bit I landed on an article from redhat
> that pointed at the possibility of an issue caused by a failing hardware. I
> was renting the server, so I didn't bother to fill a bug report and assumed
> it was the server that was going bad. But then it happened again on my other
> server as soon as I switched the bittorrent client to the same I was using
> on that other server. I turned "tx-nocache-copy" off and ran mainline kernel
> v6.5 (on Ubuntu 23.04) for a day or two without issue. After that I switched
> the kernel back to Ubuntu's kernel (v6.2) and the server ran for a couple
> more days without issue. Two days ago I turned "tx-nocache-copy" on again
> out of curiosity (kernel v6.2), and the server didn't run into any issue
> with this setting set to on. This morning I upgraded to Ubuntu 23.10 that
> runs their version of Kernel v6.5. The kernel panicked and server rebooted a
> couple of hours later. 
> 

Please perform bisection (see Documentation/admin-guide/bug-bisect.rst
in the kernel sources for how). Also, please test latest mainline
(currently v6.6-rc7).

Comment 7 CM76 2023-10-24 08:05:49 UTC

I reinstalled/re-provisioned my main server to go back to Ubuntu 23.04 (kernel v6.2.x) two days ago. I'll keep it running with its v6.2.x kernel for a couple more days before I try 6.6-RC and git bisect <v6.5.x.

Comment 8 CM76 2023-10-25 11:37:03 UTC

Thanks for your time. I've been using 6.5.8 since yesterday and it hasn't crashed despite everything I threw at the server/bittorrent client. 

I wish I knew about "decode_stacktrace.sh" before, I wouldn't have overlooked the sparsely detailed "net:" patch when I got back to filling the bug report on Saturday. I started on the 18th but gave up as I couldn't load the crash dump from that day in the crash utility. 

[...]
[88609.634236] ? asm_exc_page_fault (/build/linux-D15vQj/linux-6.5.0/arch/x86/include/asm/idtentry.h:570) 
[88609.634249] ? skb_release_head_state (/build/linux-D15vQj/linux-6.5.0/include/linux/skbuff.h:4572 /build/linux-D15vQj/linux-6.5.0/net/core/skbuff.c:997) 
[88609.634260] consume_skb (/build/linux-D15vQj/linux-6.5.0/net/core/skbuff.c:1007 (discriminator 1) /build/linux-D15vQj/linux-6.5.0/net/core/skbuff.c:1022 (discriminator 1) /build/linux-D15vQj/linux-6.5.0/net/core/skbuff.c:1238 (discriminator 1) /build/linux-D15vQj/linux-6.5.0/net/core/skbuff.c:1232 (discriminator 1)) 
[88609.634269] tcp_mtu_probe (/build/linux-D15vQj/linux-6.5.0/net/ipv4/tcp_output.c:2446) 
[...]
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv4/tcp_output.c?h=v6.5.9#n2446

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net?h=v6.5.9&id=e8dc72cb8312c1175a832b2e69239a23e8f7d570


Thanks again.

Note You need to log in before you can comment on or make changes to this bug.