Bug 202127

Summary: cannot mount or create xfs on a 597T device
Product: File System Reporter: Manhong Dai (manhongdai)
Component: XFSAssignee: FileSystem/XFS Default Virtual Assignee (filesystem_xfs)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: bugzilla, sandeen, sandeen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.20.0-arch1-1-ARCH #1 SMP PREEMPT Mon Dec 24 03:00:40 UTC 2018 x86_64 GNU/Linux Subsystem:
Regression: No Bisected commit-id:
Attachments: output of "mkfs -t xfs -f /dev/sda"

Description Manhong Dai 2019-01-03 19:54:01 UTC
Created attachment 280259 [details]
output of "mkfs -t xfs -f /dev/sda"

mkfs.xfs version 4.19.0

While mounting an XFS file system created under kernel 4.17.0, the error is
# mount /dev/sda /treehouse/watsons_lab/s1/
mount: /treehouse/watsons_lab/s1: mount(2) system call failed: Structure needs cleaning.
#dmesg
[  397.225921] XFS (sda): SB stripe unit sanity check failed
[  397.226007] XFS (sda): Metadata corruption detected at xfs_sb_read_verify+0x106/0x180 [xfs], xfs_sb block 0xffffffffffffffff 
[  397.226090] XFS (sda): Unmount and run xfs_repair
[  397.226126] XFS (sda): First 128 bytes of corrupted metadata buffer:
[  397.226173] 00000000: 58 46 53 42 00 00 10 00 00 00 00 22 c9 7a 00 00  XFSB.......".z..
[  397.226228] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  397.226283] 00000020: c3 44 e5 14 dd 09 45 69 b0 7d 3f 9b 4b 0a fa d9  .D....Ei.}?.K...
[  397.226337] 00000030: 00 00 00 11 60 00 00 08 00 00 00 00 00 00 08 00  ....`...........
[  397.226392] 00000040: 00 00 00 00 00 00 08 01 00 00 00 00 00 00 08 02  ................
[  397.226474] 00000050: 00 00 00 01 0f ff fe 00 00 00 02 2d 00 00 00 00  ...........-....
[  397.226530] 00000060: 00 07 f6 00 bd a5 10 00 02 00 00 08 00 00 00 00  ................
[  397.226585] 00000070: 00 00 00 00 00 00 00 00 0c 0c 09 03 1c 00 00 01  ................
[  397.226651] XFS (sda): SB validate failed with error -117.

#mkfs -t xfs -f /dev/sda &> mkfs.log  #the log file is attached.
Comment 1 Manhong Dai 2019-01-03 19:56:17 UTC
[root@watsons-s1 ~]# grep sda /proc/partitions 
   8        0 597636415488 sda


By the way, the xfs filesystem works without any issue under the environment below
[root@watsons-s1 ~]# mkfs.xfs -V
mkfs.xfs version 4.17.0
[root@watsons-s1 ~]# uname -a
Linux watsons-s1.mbni.org 4.17.5-1-ARCH #1 SMP PREEMPT Sun Jul 8 17:27:31 UTC 2018 x86_64 GNU/Linux
Comment 2 Eric Sandeen 2019-01-03 21:13:57 UTC
>          =                       sunit=256    swidth=64 blks

It's very strange to have automatically arrived at an impossible geometry with swidth < sunit.

Can you provide the output of:

# blockdev --getsz  --getsize64 --getss --getpbsz  --getiomin   --getioopt /dev/sda

please?
Comment 3 Manhong Dai 2019-01-03 22:18:56 UTC
[root@watsons-s1 ~]# blockdev --getsz  --getsize64 --getss --getpbsz  --getiomin   --getioopt /dev/sda
1195272830976
611979689459712
512
4096
1048576
262144
Comment 4 Dave Chinner 2019-01-03 22:51:13 UTC
On Thu, Jan 03, 2019 at 10:18:56PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202127
> 
> --- Comment #3 from daimh@umich.edu ---
> [root@watsons-s1 ~]# blockdev --getsz  --getsize64 --getss --getpbsz 
> --getiomin   --getioopt /dev/sda
> 1195272830976
> 611979689459712
> 512
> 4096
> 1048576
> 262144

Ok, so iomin=1MB  and ioopt=256k. That's an invalid block device
configuration. If this is coming from hardware RAID (e.g. via a scsi
code page) then this is a firmware bug. If this is coming from a
softare layer (e.g. lvm, md, etc) then it may be a bug in one of
those layers.

IOWs, we need to know what your storage stack configuration is and
what hardware underlies /dev/sda. Can you attach the storage stack
and hardware information indicated in this link for us?

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Eric, we probably need to catch this in mkfs when validating blkid
information - the superblock verifier is too late to be catching
bad config info like this....

Cheers,

Dave.
Comment 5 Manhong Dai 2019-01-03 22:58:41 UTC
/dev/sda is a hardware raid. lspci -vvv output is below. By the way, /dev/sda has been working great with kernel 4.17.5-1-ARCH and mkfs.xfs version 4.17.0. I can create, mount and copy 200+ T data without any glitches. Further, If I downgrade linux kernel and xfsprogs, /dev/sda can work.


02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3316 [Intruder] (rev 01)
        Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9361-16i
        Physical Slot: 3
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 26
        NUMA node: 0
        Region 0: I/O ports at 6000 [size=256]
        Region 1: Memory at c7240000 (64-bit, non-prefetchable) [size=64K]
        Region 3: Memory at c7200000 (64-bit, non-prefetchable) [size=256K]
        Expansion ROM at c7100000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <2us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x8 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
        Capabilities: [d0] Vital Product Data
                Not readable
        Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [c0] MSI-X: Enable+ Count=97 Masked-
                Vector table: BAR=1 offset=0000e000
                PBA: BAR=1 offset=0000f000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 04000001 0018000f 02010000 89faae40
        Capabilities: [1e0 v1] Secondary PCI Express <?>
        Capabilities: [1c0 v1] Power Budgeting <?>
        Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Kernel driver in use: megaraid_sas
        Kernel modules: megaraid_sas
Comment 6 Chris Murphy 2019-01-03 23:08:05 UTC
What do you get for 'MegaCli64 -AdpAllInfo -aAll' and parse for firmware revision? Current on the website is 24.22.0-0034 dated Oct 12 2018. Upgrading it carries some risk *shrug* there might be a changelog you can download (now at Broadcom) and see if there's anything related to this problem that's been fixed.
Comment 7 Eric Sandeen 2019-01-03 23:12:27 UTC
dchinner: Agreed, we need to validate this and refuse to proceed with nonsense

daimh: I'm curious, what geometry does mkfs.xfs (or xfs_info) give you when you use xfsprogs-4.17.0?
Comment 8 Chris Murphy 2019-01-03 23:14:43 UTC
Ok looks like firmware package 24.22.0-0034 translates to firmware 4.740.00-8394. I can't parse this file by only giving it 2 minutes read, so I have no idea what firmware has this change but I found this:

SCGCQ01150173 - (Port_Complete) - Request for NVdata change on 9361-8i to make the minimum stripe size 16K (instead of 64K)

But you have a 16i, not 8i. OK whatever. I would just flash it with the latest if it doesn't already have it. And if it's still a problem file a bug with Broadcom and reference this bug, specifically comment 4.
Comment 9 Manhong Dai 2019-01-03 23:17:26 UTC
[root@watsons-s1 ~]# /opt/MegaRAID/storcli/storcli /c0 show | head -n 30
Generating detailed summary of the adapter, it may take a while to complete.

Controller = 0
Status = Success
Description = None

Product Name = AVAGO MegaRAID SAS 9361-16i
Serial Number = SK82367083
SAS Address =  500062b201dbbcc0
PCI Address = 00:02:00:00
System Time = 01/03/2019 18:17:05
Mfg. Date = 06/06/18
Controller Time = 01/03/2019 23:17:05
FW Package Build = 24.19.0-0047
BIOS Version = 6.34.01.0_4.19.08.00_0x06160200
FW Version = 4.720.00-8218
Driver Name = megaraid_sas
Driver Version = 07.706.03.00-rc1
Current Personality = RAID-Mode
Vendor Id = 0x1000
Device Id = 0xCE
SubVendor Id = 0x1000
SubDevice Id = 0x9371
Host Interface = PCI-E
Device Interface = SAS-12G
Bus Number = 2
Device Number = 0
Function Number = 0
Drive Groups = 1
Comment 10 Manhong Dai 2019-01-03 23:20:40 UTC
While I am downgrading the server to get the xfs_info in a minute, I don't want to upgrade the firmware,  because it worked beautifully under the two old versions of Linux and xfsprogs, and I don't want to get fired. :)
Comment 11 Dave Chinner 2019-01-03 23:22:17 UTC
On Thu, Jan 03, 2019 at 10:58:41PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202127
> 
> --- Comment #5 from daimh@umich.edu ---
> /dev/sda is a hardware raid. lspci -vvv output is below. By the way, /dev/sda
> has been working great with kernel 4.17.5-1-ARCH and mkfs.xfs version 4.17.0.
> I
> can create, mount and copy 200+ T data without any glitches. Further, If I
> downgrade linux kernel and xfsprogs, /dev/sda can work.

Sure, that's because newer kernels and tools are much more stringent
about validity checking the on-disk information. And, in this case,
newer tools have found a validity problem that the older tools and
kernel didn't.

You can use xfs_db to fix the broken alignment in the superblock,
but I'd like to get to the bottom of where the problem is coming
from first.

> 02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3316
> [Intruder] (rev 01)

Ok, so it's an lsi/broadcom 3316 hardware RAID controller. Which
means this is most likely a firmware bug. Can you update the raid
controller to the latest firmware and see if the block dvice still
reports the same iomin/ioopt parameters?

If so, you need to talk to your vendor about getting their hardware
bug fixed and ensure their QA deficiencies are addressed, then use
xfs_db to rewrite the stripe unit/stripe width to valid values so
you can continue to use the filesystem on modern kernels.

Cheers,

Dave.
Comment 12 Manhong Dai 2019-01-03 23:24:52 UTC
[root@watsons-s1 ~]# uname -a
Linux watsons-s1.mbni.org 4.17.5-1-ARCH #1 SMP PREEMPT Sun Jul 8 17:27:31 UTC 2018 x86_64 GNU/Linux
[root@watsons-s1 ~]# mkfs.xfs -V
mkfs.xfs version 4.17.0
[root@watsons-s1 ~]# mkfs -t xfs -f /dev/sda  
meta-data=/dev/sda               isize=512    agcount=557, agsize=268434944 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=149409103872, imaxpct=1
         =                       sunit=256    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

I will thinking about the firmware upgrade tomorrow. :)
Comment 13 Eric Sandeen 2019-01-03 23:32:36 UTC
(In reply to daimh from comment #12)

> [root@watsons-s1 ~]# mkfs.xfs -V
> mkfs.xfs version 4.17.0
> [root@watsons-s1 ~]# mkfs -t xfs -f /dev/sda  

Thanks.  That's what I suspected, i.e. older xfsprogs just didn't validate or notice this problem, but I was too lazy to check the history. ;) Thanks for the test and the info.
Comment 14 Manhong Dai 2019-01-04 04:42:45 UTC
I just upgraded the firmware from 24.19.0-0047 to the latest 24.22.0-0034, also upgrade the kernel to 4.20.0-arch1-1-ARCH with xfsprogs version to 4.19.0, the problem persists. 

[root@watsons-s1 ~]# /opt/MegaRAID/storcli/storcli /c0 show all | grep -i firmware
Firmware Package Build = 24.22.0-0034
Firmware Version = 4.740.00-8394
Support PD Firmware Download = No

[root@watsons-s1 ~]# uname -a
Linux watsons-s1.mbni.org 4.20.0-arch1-1-ARCH #1 SMP PREEMPT Mon Dec 24 03:00:40 UTC 2018 x86_64 GNU/Linux

[root@watsons-s1 ~]# mkfs.xfs -V
mkfs.xfs version 4.19.0

[root@watsons-s1 ~]# blockdev --getsz  --getsize64 --getss --getpbsz  --getiomin   --getioopt /dev/sda
1195272830976
611979689459712
512
4096
1048576
262144
Comment 15 Manhong Dai 2019-01-04 21:59:11 UTC
I have a few back and forth with Broadcom support. He checked some logs generated by their standard error check script and the logs looks fine to him.

He asked to run mkfs.ext4 on the block device, it failed with an error message "mkfs.ext4: Size of device (0x22c97a0000 blocks) /dev/sda too big to create a  filesystem using a blocksize of 4096."

Then he asked me if I can create XFS and EXT4 on a smaller partition. I fdisked the block device to two smaller partitions, one is 2T the other is 50T. mkfs.ext4 worked on both partition, but mkfs.xfs still reports the same error.

Then I was told to contact you guys again. Here is his latest reply.

"I think it is a xfs issue here.  Can you go back to the developer ? What were
the changes from 4.17 to 4.19 ?

If the problem is caused by our firmware, ext4 would have failed."
Comment 16 Eric Sandeen 2019-01-04 22:02:58 UTC
Broadcom is simply wrong - ext4 doesn't care about or use the stripe geometry like xfs does.  But that is beside the point, because:

It is not valid to have a preferred I/O size smaller than the minimum I/O size - this should be obvious to the vendor.  We can detect this at mkfs time and error out or ignore the bad values, but there can be no debate about whether the hardware is returning nonsense.  It /is/ returning nonsense.  That's a firmware bug.

Pose the question to broadcom:

"How can the preferred IO size be less than the minimum allowable IO size?

Because that's what the hardware it telling us.

-Eric
Comment 17 Dave Chinner 2019-01-04 23:03:50 UTC
On Fri, Jan 04, 2019 at 10:02:58PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202127
> 
> --- Comment #16 from Eric Sandeen (sandeen@sandeen.net) ---
> Broadcom is simply wrong - ext4 doesn't care about or use the stripe geometry
> like xfs does.  But that is beside the point, because:
> 
> It is not valid to have a preferred I/O size smaller than the minimum I/O
> size
> - this should be obvious to the vendor.  We can detect this at mkfs time and
> error out or ignore the bad values, but there can be no debate about whether
> the hardware is returning nonsense.  It /is/ returning nonsense.  That's a
> firmware bug.
> 
> Pose the question to broadcom:
> 
> "How can the preferred IO size be less than the minimum allowable IO size?

Just to clarify, the "minio" being reported here is not the
"minimum allowable IO size". The minimum allowed IO size is the
logical sector/block size of the device.

"minimum_io_size" is badly named - it's actually the smallest IO
size alignment that allows for /efficient/ IO operations to be
performed by the device, and that's typically very different to
logical_block_size of the device. e.g:

$ cat /sys/block/sdf/queue/hw_sector_size
512
$ cat /sys/block/sdf/queue/logical_block_size
512
$ cat /sys/block/sdf/queue/physical_block_size
4096
$ cat /sys/block/sdf/queue/minimum_io_size
4096
$

So, we can do 512 byte sector IOs to this device, but it's not
efficient due to it having a physical 4k block size. i.e. ti
requires a RMW cycle to do a 512 byte write.

IOWs, a 4k IO (minimum_io_size) will avoid physical block RMW cycles
as the physical block size of the storage is 4k. That's what
"minimum efficient IO size" means.

For a RAID5/6 lun, this is typically the chunk size, as
many RAID implementations can do single chunk aligned writes
efficiently via partial stripe recalculation without needing RMW
cycles. If the write partially overlaps chunks, then RMW cycles are
required for RAID recalc, hence setting the RAID chunk size as the
"minimum_io_size" makes sense.

However, a device may not be efficient and reach saturation when
fed lots of minimum_io_size requests. That's where optimal_io_size
comes in - a lot of SSDs out there have an optimal IO size in the
range of 128-256KB because they can't reach max throughput when
smaller IO sizes are used (iops bound). i.e. the optimal IO size is
the size of the Io that will allow the entire bandwidth of the
device to be effectively utilised.

For a RAID5/6 lun, the optimal IO size is the one that keeps all
disk heads moving sequentially and in synchronisation and doesn't
require partial stripe writes (and hence RMW cycles) to occur. IOWs,
its the IO alignment and size that will allow full stripe writes to
be sent to the underlying device.

By definition, the optimal_io_size is /always/ >= minimum_io_size.
If the optimal_io_size is < minimum_io_size, then one of them is
incorrectly specified. The only time this does not hold true is when
the device does not set a optimal_io_size, in which case it should
be zero and then gets ignored by userspace.

Regardless, what still stands here is that the firmware needs fixing
and that is only something broadcom can fix.

Cheers,

Dave.
Comment 18 Eric Sandeen 2019-01-04 23:08:42 UTC
Thanks for the clarification, Dave, I shouldn't have made that mistake.

New question for broadcom:

"How can the preferred IO size be smaller than the minimum efficient IO size?"
Comment 19 Manhong Dai 2019-01-05 01:36:10 UTC
"I will check with engineering next week and get back to you."

This is the latest reply from Broadcom support. I  will keep you guys updated. Have a great weekend!
Comment 20 eflorac 2019-01-08 17:06:52 UTC
Le Fri, 04 Jan 2019 21:59:11 +0000
bugzilla-daemon@bugzilla.kernel.org écrivait:

> https://bugzilla.kernel.org/show_bug.cgi?id=202127
> 
> --- Comment #15 from daimh@umich.edu ---
> I have a few back and forth with Broadcom support. He checked some
> logs generated by their standard error check script and the logs
> looks fine to him.
> 
> He asked to run mkfs.ext4 on the block device, it failed with an
> error message "mkfs.ext4: Size of device (0x22c97a0000
> blocks) /dev/sda too big to create a filesystem using a blocksize of
> 4096."
> 
> Then he asked me if I can create XFS and EXT4 on a smaller partition.
> I fdisked the block device to two smaller partitions, one is 2T the
> other is 50T. mkfs.ext4 worked on both partition, but mkfs.xfs still
> reports the same error.
> 
> Then I was told to contact you guys again. Here is his latest reply.
> 
> "I think it is a xfs issue here.  Can you go back to the developer ?
> What were the changes from 4.17 to 4.19 ?
> 
> If the problem is caused by our firmware, ext4 would have failed."
> 

By the way does making an LV atop this device work? Then making a FS in
this LV?
Comment 21 Manhong Dai 2019-01-08 17:59:58 UTC
still failed.

# mkfs.xfs /dev/vg/lv  &> log
# head log
SB stripe unit sanity check failed
Metadata corruption detected at 0x5571f8b8d4a7, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
cache_node_purge: refcount was 1, not zero (node=0x5571f9936160)
SB stripe unit sanity check failed
Comment 22 Manhong Dai 2019-01-22 12:49:31 UTC
An update is that Broadcom engineer cannot reproduce the issue with Arch Linux kernel 4.20.3-arch1-1-ARCH and mkfs.xfs version 4.19.0. The difference is his raid-60 is 7T and 2T, while mine is 556T. 

I upgraded both kernel and mkfs to the same version as above, the problem persists.

I downgraded them to linux-4.17.14.arch1-1 and xfsprogs-4.17.0-1 again, then xfs can be created and mounted without any glitch.
Comment 23 Eric Sandeen 2019-01-22 14:18:49 UTC
"An update is that Broadcom engineer cannot reproduce the issue with Arch Linux kernel 4.20.3-arch1-1-ARCH and mkfs.xfs version 4.19.0."

Which issue is that?  Failed mkfs, or nonsensical geometry?

BTW, I hope we made it clear that you can (probably?) move forward by manually specifying sunit & swidth to override the junk values from the hardware.
Comment 24 Manhong Dai 2019-01-22 14:35:37 UTC
Thanks a lot for reminding me su&sw, I just did this and mkfs.xfs worked

# /opt/MegaRAID/storcli/storcli /c0 /v0 show all | grep  "Strip\|Drives\|RAID"
0/0   RAID60 Optl  RW     Yes     RAWBD -   ON  556.591 TB      
Strip Size = 1.0 MB
Number of Drives Per Span = 19
# mkfs.xfs -d su=1m,sw=17 /dev/sda 
# mount /dev/sda /treehouse/watsons_lab/s1/

But I still need help from you guys. What about my other storage servers that have a huge XFS system created under older Kernel 4.17.* and xfsprogs 4.17 ? I cannot mount it under the latest ArchLinux, and those filesystem are quite full..
Comment 25 Eric Sandeen 2019-01-22 15:01:27 UTC
Will it mount with "mount -o noalign" or "mount -o sunit=$VALUE1,swidth=$VALUE2" ?
Comment 26 Manhong Dai 2019-01-22 15:36:04 UTC
mount -o noalign or mount -o sunit=,swidth doesn't work. Here is the commands.

########old kernel
[root@watsons-s1 ~]# uname -a
Linux watsons-s1.mbni.org 4.17.14-arch1-1-ARCH #1 SMP PREEMPT Thu Aug 9 11:56:50 UTC 2018 x86_64 GNU/Linux
[root@watsons-s1 ~]# mkfs.xfs -V
mkfs.xfs version 4.17.0
[root@watsons-s1 ~]# mkfs.xfs -f /dev/sda
meta-data=/dev/sda               isize=512    agcount=557, agsize=268434944 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=149409103872, imaxpct=1
         =                       sunit=256    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


############new kernel
[root@watsons-s1 ~]# uname -a
Linux watsons-s1.mbni.org 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC 2019 x86_64 GNU/Linux
[root@watsons-s1 ~]# mkfs.xfs -V
mkfs.xfs version 4.19.0
[root@watsons-s1 ~]# mount -t xfs -o noalign /dev/sda /treehouse/watsons_lab/s1/
mount: /treehouse/watsons_lab/s1: mount(2) system call failed: Structure needs cleaning.
[root@watsons-s1 ~]# dmesg
[  289.750470] XFS (sda): SB stripe unit sanity check failed
[  289.750564] XFS (sda): Metadata corruption detected at xfs_sb_read_verify+0x106/0x180 [xfs], xfs_sb block 0xffffffffffffffff
[  289.752487] XFS (sda): Unmount and run xfs_repair
[  289.753980] XFS (sda): First 128 bytes of corrupted metadata buffer:
[  289.755393] 00000000: 58 46 53 42 00 00 10 00 00 00 00 22 c9 7a 00 00  XFSB.......".z..
[  289.756787] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  289.758168] 00000020: 85 3d da cd 25 7c 48 e3 a7 56 b2 4c 8c d2 d5 35  .=..%|H..V.L...5
[  289.759551] 00000030: 00 00 00 11 60 00 00 08 00 00 00 00 00 00 08 00  ....`...........
[  289.760939] 00000040: 00 00 00 00 00 00 08 01 00 00 00 00 00 00 08 02  ................
[  289.762308] 00000050: 00 00 00 01 0f ff fe 00 00 00 02 2d 00 00 00 00  ...........-....
[  289.763670] 00000060: 00 07 f6 00 bd a5 10 00 02 00 00 08 00 00 00 00  ................
[  289.765016] 00000070: 00 00 00 00 00 00 00 00 0c 0c 09 03 1c 00 00 01  ................
[  289.766432] XFS (sda): SB validate failed with error -117.
[root@watsons-s1 ~]# mount -t xfs -o sunit=256,swidth=64 /dev/sda /treehouse/watsons_lab/s1/
mount: /treehouse/watsons_lab/s1: wrong fs type, bad option, bad superblock on /dev/sda, missing codepage or helper program, or other error.
[root@watsons-s1 ~]# dmesg
[  316.159473] XFS (sda): stripe width (64) must be a multiple of the stripe unit (256)
Comment 27 Eric Sandeen 2019-01-22 15:55:04 UTC
It may still fail, but you've specified your sunit * swidth incorrectly (backwards)

[  316.159473] XFS (sda): stripe width (64) must be a multiple of the stripe unit (256)

From the manpage:

       sunit=value and swidth=value
              Used  to  specify  the  stripe  unit  and width for a RAID
              device or a stripe volume.  "value" must be  specified  in
              512-byte  block  units. These options are only relevant to
              filesystems that were created with non-zero data alignment
              parameters.

              The sunit and swidth parameters specified must be compati‐
              ble with the existing  filesystem  alignment  characteris‐
              tics.   In  general,  that means the only valid changes to
              sunit are increasing it by a  power-of-2  multiple.  Valid
              swidth  values  are  any integer multiple of a valid sunit
              value.
Comment 28 Manhong Dai 2019-01-22 17:38:56 UTC
The "sunit=256,swidth=64" comes from the output of mkfs.xfs in old kernel.

I actually tried 'sunit=256,swidth=256', the dmesg error is exactly the same as 'noalign'
Comment 29 Eric Sandeen 2019-01-22 17:48:01 UTC
(In reply to daimh from comment #28)
> The "sunit=256,swidth=64" comes from the output of mkfs.xfs in old kernel.

Yes, that's the root-cause bug causing all this pain.

> I actually tried 'sunit=256,swidth=256', the dmesg error is exactly the same
> as 'noalign'

Ok, we need to sort out a way to efficiently rewrite or remove the geometry from those old filesystems, then.  It's probably going to involve some xfs_db surgery.
Comment 30 Eric Sandeen 2019-01-22 17:48:43 UTC
Or, actually - I'm not sure if this is an option for you, but mounting with new and /correct/ sunit/swidth values on the older kernel should rewrite them on disk, and make them pass muster with the new kernel.
Comment 31 Manhong Dai 2019-01-22 18:28:15 UTC
Yes, this trick fixed the problem. Now the filesystem created under old kernel can be mounted under the latest ArchLinux. Thanks a million for your help!
Comment 32 Dave Chinner 2019-01-22 21:52:59 UTC
On Tue, Jan 22, 2019 at 12:49:31PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202127
> 
> --- Comment #22 from daimh@umich.edu ---
> An update is that Broadcom engineer cannot reproduce the issue with Arch
> Linux
> kernel 4.20.3-arch1-1-ARCH and mkfs.xfs version 4.19.0. The difference is his
> raid-60 is 7T and 2T, while mine is 556T. 

IOWs, they didn't actually test your configuration, and so they
didn't see the problem you are seeing. Which implies they've got a
problem in their firmware where something overflows at a larger size
than 7TB.  If that were my raid card, I'd be tearing strips off the
support engineer's manager by now...

Cheers,

Dave.
Comment 33 Eric Sandeen 2019-01-22 21:56:43 UTC
Agreed.  I don't think we can possibly state clearly enough that the firmware is buggy.  We can help you work around it, and add some things to the utilities to catch it, but at the end of the day, your firmware /is/ buggy and your vendor doesn't seem to understand that fact.  Good luck.  :)
Comment 34 Chris Murphy 2019-01-22 22:59:01 UTC
(In reply to daimh from comment #22)
> An update is that Broadcom engineer cannot reproduce the issue with Arch
> Linux kernel 4.20.3-arch1-1-ARCH and mkfs.xfs version 4.19.0. The difference
> is his raid-60 is 7T and 2T, while mine is 556T.

That's unacceptable customer service. It's lazy, amateurish, and insulting. You've done their homework for them by tracking down this bug, they need to put more effort into providing an actual fix. I'd file a warranty claim regardless of whether it's currently under warranty or not, it's always been defective from the outset.

I tried sending them a nastygram through corporate feedback, but that fails. Guess I'll send it by Twitter.
Comment 35 Manhong Dai 2019-01-24 13:02:33 UTC
Another update is that Broadcom engineer emailed me last night that they are investigating.
Comment 36 Pingmin Fenlly Liu 2019-03-25 08:46:42 UTC
By reading this discussion, I just enjoyed the great benefit from you guys!

I'm just a newbie and thanks!
Comment 37 Manhong Dai 2020-09-24 18:19:25 UTC
Today another new machine had the same problem. I fixed it by upgrading the firmware.


####Here is the info before the update
# /opt/MegaRAID/storcli/storcli  /c0 show all |grep ^Firmware
Firmware Package Build = 24.19.0-0049
Firmware Version = 4.720.00-8220
# blockdev --getiomin   --getioopt /dev/sdc
1048576
262144
# mkfs -t xfs /dev/sdc 
meta-data=/dev/sdc               isize=512    agcount=262, agsize=268434944 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=70314098688, imaxpct=1
         =                       sunit=256    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
SB stripe unit sanity check failed
Metadata corruption detected at 0x56324656db89, xfs_sb block 0x0/0x1000
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1000
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
SB stripe unit sanity check failed
Metadata corruption detected at 0x56324656db89, xfs_sb block 0x0/0x1000
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1000
mkfs.xfs: writing AG headers failed, err=117


#####Then I upgraded with the command below and rebooted it
# /opt/MegaRAID/storcli/storcli  /c0 download file=mr3316fw.rom 

#####Here the info after the update
# /opt/MegaRAID/storcli/storcli  /c0 show all |grep -i firmware
Firmware Package Build = 24.22.0-0071
Firmware Version = 4.740.00-8452
# blockdev --getiomin   --getioopt /dev/sdc
262144
262144
#
Comment 38 Eric Sandeen 2020-09-24 18:52:49 UTC
Great, thanks for the update on the resolution.
Comment 39 Eric Sandeen 2020-09-24 18:57:40 UTC
FWIW, these were in the changelogs :)

SCGCQ02027889 - Cannot create or mount xfs filesystem using xfsprogs 4.19.x kernel 4.20

so presumably it was an intentional fix.  :)