Bug 218601 - Regression - dd if=/dev/zero of=/zero causes shift-out-of-bounds && NULL pointer dereference, address: 0000000000000003
Summary: Regression - dd if=/dev/zero of=/zero causes shift-out-of-bounds && NULL poi...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-15 01:46 UTC by Colin
Modified: 2024-03-21 16:20 UTC (History)
3 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel output of revlevant logging (18.16 KB, text/plain)
2024-03-15 01:46 UTC, Colin
Details
kernel 6.8.1 dmesg of stress-ng (130.15 KB, text/plain)
2024-03-21 05:20 UTC, Colin
Details

Description Colin 2024-03-15 01:46:37 UTC
Created attachment 305991 [details]
Kernel output of revlevant logging

I have a 13900K intel processor w/ 2 x 2TB Samsung 980 pro SSDs. `dd if=/dev/zero of=/zero` causes null pointer de-references and subsequently hardlocks my system after about 200GB written as per the dmesg attached, though the amount written varies +- 100gb on each reproduction. I have been able to trigger this in both devices, I am unable to re-trigger in another mid-range PCIE 4.0 device I have.

I have been able to reproduce in fully updated installs of:

- Ubuntu 23.10 
- Debian 12.5
- Ubuntu 23.10 Live USB (dding to the correct device and not live USB root)
- Arch 2024.03.01 live USB

I have been unable to reproduce via a LiveUSB install of Ubuntu 22.10, nor have I been able to reproduce under Windows 11 with git bash acknowledging that it may be a flawed test case.

I make reference to https://bugzilla.kernel.org/show_bug.cgi?id=216422 also which I suspect is related but the distros listed above should have well and truly received those patches. 

If patches are supplied as part of this bug ticket, could somebody kindly advise on how to apply and deploy the patches in an accepted manner so I don't deviate from expected test conditions? 

I'm new to kernel bug reporting, please let me know if I've missed anything.
Comment 1 Keith Busch 2024-03-15 02:07:38 UTC
Looks odd, but this doesn't appear related to nvme.
Comment 2 Colin 2024-03-15 04:32:00 UTC
I'm unsure what the correct topic would be. I'm happy to change it if you suggest a new one, or for you to change it if bugzilla allows you to. 

https://bbs.archlinux.org/viewtopic.php?pid=2102715 suggested ucsci_acpi but I believe that's USB, I'm unsure if this would be ACPI either (though, I did disable power management both in BIOS and via the kernel parameter) to no avail.
Comment 3 Artem S. Tashkinov 2024-03-15 10:07:09 UTC
Did you actually bisect?

The commit you've provided doesn't look like it might have caused the issue:

https://github.com/torvalds/linux/commit/326e1c208f3f24d14b93f910b8ae32c94923d22c
Comment 4 Artem S. Tashkinov 2024-03-15 10:18:52 UTC
I'm not an expert bug your backtrace looks weird. Please run memtest86 or memtest86+ for an hour or two.

https://www.memtest86.com/download.htm

https://github.com/memtest86plus/memtest86plus/releases


I'm also removing the bisected ID because it doesn't look like it has anything to do with this issue.

You don't just copy random stuff from other bug reports which look similar to yours. You actually bisect and provide the full bisect history.

Ubuntu 22.10 comes with kernel Kernel 5.19.

Ubuntu 23.10 comes with kernel Kernel 6.5.

That's a very large regression window. And then you can only bisect on vanilla kernels, not Ubuntu ones. First compile and make sure 5.19 absolutely works for you. Then try consecutive kernels one by one. If you hit the issue, then you will have just two kernel releases to work with, instead of trying to find the regressions between two very distant kernels.

https://docs.kernel.org/admin-guide/bug-bisect.html

Best of luck.
Comment 5 Theodore Tso 2024-03-15 16:55:00 UTC
Also note that upstream Linux kernel developers do not provide free kernel support for Ubuntu (or any Distro) kernel.   If you want support for your distro kernel, in the case of Ubuntu, you need to pay $$$ to Canonical.  Otherwise, please try to replicate the problem on the *current* upstream kernel, and not an ancient kernel such as 5.19 or 6.5.
Comment 6 Colin 2024-03-19 06:22:42 UTC
Firstly, apologies for any incorrectness - the commit reference link did not accept inputs that were 'probably this sha' or similar - it was not meant in malice or laziness, but admittedly I was very tiered as a result of the unpredictability of this issue. 

I suspect this may actually be dead hardware but it's hard to tell. If somebody is  interested in exploring this issue further feel free to provide instruction over the next few days, otherwise I'll buy a new motherboard / RMA the processor if the motherboard does not fix the issue. I value both my and all of your time, so buying new equipment in the hopes that I'm rid of the problem is my preference unless there's some burning desire for further exploration.

Here are some facts:

- `stress-ng --class cpu --seq 32` reliably crashes the machine in less than 60 seconds with the error message 'stress-ng: fail:  [2701] af-alg: ctr(twofish): decrypted data different from original data (possible kernel bug)', as well as other algos (pcbc(fcrypt), cbc(sm4) etc) noting `dd if=/dev/zero` is not cryptographic. I routinely checksum files with sha and do not notice any inconsistencies.
- I have tried vanilla kernels inc. 6.0.1 and 6.8, `dd` _seems_ to fail faster with more recent kernels, but maybe that's just due to a small test sample size.
- I have been unable to crash Windows using Prime95 (24h) Furmark (24h) nor dd, nor am I aware of any errors from either tool. Windows install routinely BSOD, but went away as soon as I switched to a different USB stick both freshly flashed - it's possible this is related, but it could also be a bad USB.
- The motherboard is an ASUS Prime Z790-P WiFi D4 LGA 1700, the processor a 13900k, I have been experiencing the issue for about 12 months, seemingly it has become more frequent lately
- I cannot visually see any problems with the motherboard, no bloated capacitors as far as I can tell.
- I have replaced the RAM, PSU, SSDs (w/ an non-samsung model) and removed all aux cards, with the exception of onboard wifi6 which cannot be removed
- Memtest86+ was run with my original 128G ram configuration on 2x24h occasions and did not yield any errors indicating cpu<>memory integrity is not the issue
Comment 7 Keith Busch 2024-03-19 15:08:03 UTC
(In reply to Colin from comment #6)
> I suspect this may actually be dead hardware but it's hard to tell.

I can't necessarily rule that out, however, based on the stack trace you attached, this looks like a software bug. If you want to go further, I recommend following Ted's suggestion and attempt to reproduce with a recent upstream kernel. The most recent stable version as I write this is 6.8.1.
Comment 8 Colin 2024-03-21 05:20:03 UTC
Attached is the dmesg output of 6.8.1 built manually from vanilla sources under Ubuntu 23.10 when running `stress-ng --class cpu --seq 32`. 

A few things I find odd: 

- If this was a bug, surely the hardware is not esoteric enough that nobody else would have experienced this? 
- Again, it feels like this has 'gotten worse' since I first experienced it, and older kernel seem more stable
- Building the linux kernel was segfaulting at different locations, I ended up building it on another machine.
Comment 9 Colin 2024-03-21 05:20:37 UTC
Created attachment 306016 [details]
kernel 6.8.1 dmesg of stress-ng
Comment 10 Christian Kujau 2024-03-21 15:07:31 UTC
> Building the linux kernel was segfaulting at different locations, 
> I ended up building it on another machine.

Not good. Maybe swap/remove some RAM modules. See also: https://bitwizard.nl/sig11/
Comment 11 Keith Busch 2024-03-21 16:20:55 UTC
(In reply to Colin from comment #9)
> Created attachment 306016 [details]
> kernel 6.8.1 dmesg of stress-ng

Doesn't look like the same failure as the first attachment. Maybe your hardware really is broken.

Note You need to log in before you can comment on or make changes to this bug.