Bug 217884 - Random kernel panic since > 6.3.7
Summary: Random kernel panic since > 6.3.7
Status: NEW
Alias: None
Product: Linux
Classification: Unclassified
Component: Kernel (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Virtual assignee for kernel bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-07 18:26 UTC by cyayon
Modified: 2023-09-16 04:46 UTC (History)
1 user (show)

See Also:
Kernel Version: 6.3.9
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description cyayon 2023-09-07 18:26:50 UTC
Hi,

I have random kernel panic on my archlinux server, since upgrade from > kernel 6.3.7.  
It's a headless server, no X11, no graphic card, and it is an intel processor (not AMD).

On 6.3.7, everything was fine, since upgrade to 6.3.9, 6.4.3, 6.4.4, and now 6.4.12, I got random crash (sometime 24h, 48h, sometime a little bit more...).

No log in journald/syslog, I managed to get only these three lines (with mounting /var/log/journal outside my nvme boot disk) :

BUG: unable to handle page fault for address: 00000000352aa941  
#PF: supervisor write access in kernel mode  
#PF: error_code(0x0002) - not-present page

Some information :  
uname -a : Linux xxxxorg 6.4.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 24 Aug 2023 00:38:14 +0000 x86_64 GNU/Linux  
dmesg : http://ix.io/4FE8 
journalctl -b -1 | tail -1000 : http://ix.io/4FE9 (only last 1000 lines, too big)  
lspci -vvv : http://ix.io/4FEa  
lsmod  : http://ix.io/4FEf
lscpu : http://ix.io/4FEh

Of course, if I rollback to 6.3.7, no more crash.

Could you please help me to debug ?

thanks.
Comment 1 Artem S. Tashkinov 2023-09-07 21:33:04 UTC
There's not so many commits between 6.3.7 and 6.3.9, so it would be best for you just to perform regression testing using: https://docs.kernel.org/admin-guide/bug-bisect.html

It may take quite some time but it's the best shot you've got since your issue is not widespread.
Comment 2 Bagas Sanjaya 2023-09-08 07:49:28 UTC
(In reply to cyayon from comment #0)
> Hi,
> 
> I have random kernel panic on my archlinux server, since upgrade from >
> kernel 6.3.7.  
> It's a headless server, no X11, no graphic card, and it is an intel
> processor (not AMD).
> 
> On 6.3.7, everything was fine, since upgrade to 6.3.9, 6.4.3, 6.4.4, and now
> 6.4.12, I got random crash (sometime 24h, 48h, sometime a little bit
> more...).
> 
> No log in journald/syslog, I managed to get only these three lines (with
> mounting /var/log/journal outside my nvme boot disk) :
> 
> BUG: unable to handle page fault for address: 00000000352aa941  
> #PF: supervisor write access in kernel mode  
> #PF: error_code(0x0002) - not-present page
> 
> Some information :  
> uname -a : Linux xxxxorg 6.4.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 24 Aug
> 2023 00:38:14 +0000 x86_64 GNU/Linux  
> dmesg : http://ix.io/4FE8 
> journalctl -b -1 | tail -1000 : http://ix.io/4FE9 (only last 1000 lines, too
> big)  
> lspci -vvv : http://ix.io/4FEa  
> lsmod  : http://ix.io/4FEf
> lscpu : http://ix.io/4FEh
> 
> Of course, if I rollback to 6.3.7, no more crash.
> 
> Could you please help me to debug ?
> 

Please test latest mainline first.
Comment 3 cyayon 2023-09-08 07:51:25 UTC
Hello,

6.4.12 is not mainline ?

Thanks.
Comment 4 Artem S. Tashkinov 2023-09-08 09:41:48 UTC
6.4.15 is.
Comment 5 cyayon 2023-09-08 10:33:51 UTC
Sorry, I understand. 
The 6.4.12 is the last Archlinux 6.4 kernel.
I have to wait 6.5.x …
Comment 6 Bagas Sanjaya 2023-09-08 14:13:34 UTC
(In reply to Artem S. Tashkinov from comment #4)
> 6.4.15 is.

Nope. I mean v6.x and v6.x-rcy (as in Linus's tree), not v6.x.y as in
the stable one.
Comment 7 cyayon 2023-09-10 06:47:21 UTC
Hi, 

This night crash again (kernel 6.4.12). I manage to got some logs from syslog-ng (no log from journald).

Here are the last logs just before the crash.

http://ix.io/4FTW

It seems to be related to nft.

thanks.
Comment 8 Bagas Sanjaya 2023-09-11 12:49:59 UTC
(In reply to cyayon from comment #7)
> Hi, 
> 
> This night crash again (kernel 6.4.12). I manage to got some logs from
> syslog-ng (no log from journald).
> 
> Here are the last logs just before the crash.
> 
> http://ix.io/4FTW
> 
> It seems to be related to nft.
> 
> thanks.

OK, then perform bisection to find the culprit commit that introduces your
regression. If you don't know how to do so, see kernel documentation [1].

[1]: https://docs.kernel.org/admin-guide/bug-bisect.html
Comment 9 Bagas Sanjaya 2023-09-11 12:53:39 UTC
(In reply to cyayon from comment #7)
> Hi, 
> 
> This night crash again (kernel 6.4.12). I manage to got some logs from
> syslog-ng (no log from journald).
> 
> Here are the last logs just before the crash.
> 
> http://ix.io/4FTW
> 
> It seems to be related to nft.
> 
> thanks.

Now v6.6-rc1 has been released, please test. Since you're about to compile
vanilla kernel, see ArchWiki [1] for instructions.

[1]: https://wiki.archlinux.org/title/Kernel/Traditional_compilation
Comment 10 cyayon 2023-09-11 12:54:45 UTC
Hi,

Yesterday, I opened a ticket to netfilter (via email).

Pablo N. tell me the issue coming from commit bdace3b1a51887211d3e49417a18fdbd315a313b.

He also asked me to test 6.4.15 instead of 6.5.2 which is a little behind for this issue.

I don't know about 6.6rc1 vs 6.4.15.

I am currently testing and keep informed here.

Thanks
Comment 11 Bagas Sanjaya 2023-09-11 13:00:37 UTC
On 11/09/2023 19:54, bugzilla-daemon@kernel.org wrote:
> I don't know about 6.6rc1 vs 6.4.15.
> 

The former is release candidate version, tagged from Linus's tree (aka
mainline). It is primarily used for testing before official release is made.
The latter is stable kernel with fixes backported form mainline. It is
the recommended kernel to run on production. For more information, see [1].

Thanks.

[1]: https://kernel.org/category/releases.html
Comment 12 cyayon 2023-09-11 13:05:05 UTC
I would like to say that I didn't know if 6.6rc1 include revert bdace3b1a51887211d3e49417a18fdbd315a313b (like 6.4.15).
Comment 13 Bagas Sanjaya 2023-09-11 13:09:31 UTC
On 11/09/2023 20:05, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217884
> 
> --- Comment #12 from cyayon@nbux.org ---
> I would like to say that I didn't know if 6.6rc1 include revert
> bdace3b1a51887211d3e49417a18fdbd315a313b (like 6.4.15).
> 

It should already include 26b5a5712eb85e ("netfilter: nf_tables: add
NFT_TRANS_PREPARE_ERROR to deal with bound set/chain") as the fix.
Comment 14 cyayon 2023-09-11 13:10:52 UTC
thanks
Comment 15 cyayon 2023-09-15 19:15:14 UTC
Hi,
No crash since 3 days with 6.4.15.
I will wait again a few days but it should be ok, many thanks !
I asked Pablo N to know if the patch / revert has been merged to 6.5.3, waiting his answer…
Comment 16 cyayon 2023-09-16 04:46:18 UTC
Oh, no...
This morning crash again :(.

Here is the journald log : http://ix.io/4Gvd (6.4.15)

Note You need to log in before you can comment on or make changes to this bug.