Bug 202859 - Corruption when reading from disk with 32-core processor (megaraid_sas)
Summary: Corruption when reading from disk with 32-core processor (megaraid_sas)
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: scsi_drivers-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-11 02:56 UTC by Antti Tönkyrä
Modified: 2019-04-04 22:40 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.14.x, 4.15.x, 4.19.x
Subsystem:
Regression: No
Bisected commit-id:


Attachments
script I have used to trigger the problem, under default settings this script usually reproduces the problem within 2 hours (589 bytes, application/x-shellscript)
2019-03-11 02:56 UTC, Antti Tönkyrä
Details
lscpu (1.65 KB, text/plain)
2019-03-11 14:55 UTC, Antti Tönkyrä
Details
output of (cd /sys/kernel/debug/block/sdb && find . -type f -exec grep -aH . {} \;) (23.51 KB, text/plain)
2019-03-11 14:58 UTC, Antti Tönkyrä
Details
output of (cd /sys/kernel/debug/block/sdb && find . -type f -exec grep -aH . {} \;) (16.45 KB, text/plain)
2019-03-11 14:59 UTC, Antti Tönkyrä
Details
lscpu and debug-block-sdb output (17.00 KB, text/plain)
2019-03-12 08:07 UTC, Antti Tönkyrä
Details

Description Antti Tönkyrä 2019-03-11 02:56:02 UTC
Created attachment 281693 [details]
script I have used to trigger the problem, under default settings this script usually reproduces the problem within 2 hours

I have been debugging a Dell R7415 with 32-core AMD EPYC 7551P processor and the issue is that I get silent data corruption after few hours of intensive disk I/O load. Also this has been verified on 2 different servers with same components. I believe the problem could be that the PERC H330 Mini firmware is somehow faulty or the megaraid_sas driver is broken.

At first I was running a HW-raid setup on the H330 Mini controller that was on the server and got XFS filesystem corrupted more or less beyond repair. During subsequent testing I converted the disks into JBOD mode on the controller and made individual BTRFS filesystems on all disks to see checksum errors in case they return bad data. During testing I made a small script that will usually create the problem within 2 hours (attached as dell_fs_test.sh).

The problem starts by disks returning checksum errors (even disks that are not written to). In my test setup I have 1 OS disk mounted read-only and 6 other disks mounted read-write (seen in the script). When the issue is triggered the OS-disk starts giving out bad data as well. Any ramfs disks/live media doesn't seem to be affected so only the disks behind the H330 Mini controller. After prolonged periods of checksum errors, I managed to even make the disk capacities jump around 512 to whatever they really are by simple dd if=/dev/urandom of=/dev/sdX bs=1M count=100 (see below)

[26301.605563] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
[26304.242496] sd 0:0:1:0: [sdb] Sector size 0 reported, assuming 512.
[26304.244081] sd 0:0:1:0: [sdb] 1 512-byte logical blocks: (512 B/512 B)
[26304.244083] sd 0:0:1:0: [sdb] 0-byte physical blocks
[26304.245853] sdb: detected capacity change from 480103981056 to 512
[26314.315108] sd 0:0:2:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[26683.822304] sd 0:0:2:0: [sdc] Sector size 0 reported, assuming 512.
[26683.824020] sd 0:0:2:0: [sdc] 1 512-byte logical blocks: (512 B/512 B)
[26683.824022] sd 0:0:2:0: [sdc] 0-byte physical blocks
[26683.825751] sd 0:0:2:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[26683.825754] sdc: detected capacity change from 480103981056 to 512
[26684.020835] sd 0:0:2:0: [sdc] Sector size 0 reported, assuming 512.
[26946.615148] sd 0:0:3:0: [sdd] Sector size 0 reported, assuming 512.
[26946.617214] sd 0:0:3:0: [sdd] 1 512-byte logical blocks: (512 B/512 B)
[26946.617216] sd 0:0:3:0: [sdd] 0-byte physical blocks
[26946.619055] sd 0:0:3:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA
[26946.620292] sdd: detected capacity change from 4000787030016 to 512

I have managed to work around the problem by limiting the CPU to 24 cores (48 threads) in BIOS and haven't been able to reproduce any corruption with such limitation but immediately when switching back to the full 32c/64t configuration the corruption starts happening again.

In an effort of investigating the issue further I toyed around megaraid_sas module parameters and it would seem that setting smp_affinity_enable=0 to the module stops the problem from happening or at least makes it less likely to happen. At the time I'm writing this I have been running 4 hours of stress on the disks and haven't produced any corruption.

Oh and the controller firmware log doesn't show any errors. OS is silent too unless using a checksumming FS such as BTRFS (or when something like XFS metadata gets hosed).

Right now I'm at a loss on how to further debug the problem so here is my report. Feel free to ask for more details :)
Comment 1 Antti Tönkyrä 2019-03-11 03:16:07 UTC
Oh, and the script assumes that folders /mnt/{b,c,d,e,f,g} exist.
Comment 2 Ming Lei 2019-03-11 03:55:57 UTC
Can you reproduce this issue by just running IO on single LUN? such as, just
mount /dev/sdb on /mnt/b.

Also, could you collect the following logs on 4.19.x kernel?

1) lscpu

2) (cd /sys/kernel/debug/block/sdb && find . -type f -exec grep -aH . {} \;)
Comment 3 Antti Tönkyrä 2019-03-11 14:55:59 UTC
Created attachment 281713 [details]
lscpu
Comment 4 Antti Tönkyrä 2019-03-11 14:58:35 UTC
Created attachment 281715 [details]
output of (cd /sys/kernel/debug/block/sdb && find . -type f -exec grep -aH . {} \;)

taken after checksum errors have started appearing
Comment 5 Antti Tönkyrä 2019-03-11 14:59:40 UTC
Created attachment 281717 [details]
output of (cd /sys/kernel/debug/block/sdb && find . -type f -exec grep -aH . {} \;)

taken before checksum errors have appeared on dmesg
Comment 6 Antti Tönkyrä 2019-03-11 15:12:54 UTC
After doing a 6 hour long run, I was unable to trigger the issue on a single disk. I'll go for another run now but I have attached the information requested above. Output of the command is during a test on 6 disks.
Comment 7 Antti Tönkyrä 2019-03-11 20:45:32 UTC
Update, I did another run with slightly lighter load on single disk and reproduced the problem successfully. So can be reproduced when reading/writing on single disk. (SSD in this case)
Comment 8 Antti Tönkyrä 2019-03-12 02:40:35 UTC
With single disk I was able to reproduce the problem with smp_affinity_enable=0 too.
Comment 9 Ming Lei 2019-03-12 02:48:50 UTC
From the debugfs log, the blk-mq mapping is just one simple 1:N, all online 64 CPUs(128 all possible CPUs) are mapped to hw queue 0.

I don't understand why this issue doesn't happen after you reduce CPU number
to 24. Could you collect the following logs again after CPU number is reduced
to 24?

1) lscpu

2) (cd /sys/kernel/debug/block/sdb && find . -type f -exec grep -aH . {} \;)
Comment 10 Antti Tönkyrä 2019-03-12 08:07:13 UTC
Created attachment 281751 [details]
lscpu and debug-block-sdb output

Both outputs in this attachment.
Comment 11 Antti Tönkyrä 2019-04-04 22:40:03 UTC
Fixed by https://lkml.org/lkml/2019/3/28/202

Note You need to log in before you can comment on or make changes to this bug.