This just started happening with mainline, but I bisected it back to the following commit: commit d3805611130af9b911e908af9f67a3f64f4f0914 Author: Keith Busch <keith.busch@intel.com> Date: Tue Dec 22 15:48:44 2015 -0700 block: Split bios on chunk boundaries For h/w that advertise their block storage's underlying chunk size, it's a big performance win to not submit commands that cross them. This patch uses that criteria if it is provided. If it is not provided, this patch uses the max sectors as before. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com> [ 938.125561] kernel BUG at block/bio.c:1787! [ 938.127100] invalid opcode: 0000 [#1] SMP [ 938.128622] Modules linked in: zram [ 938.130128] CPU: 1 PID: 3424 Comm: rsync Tainted: G U 4.4.0-rc7-GTW+ #1 [ 938.131647] Hardware name: ASUS All Series/SABERTOOTH Z97 MARK 1, BIOS 2702 10/27/2015 [ 938.133170] task: ffff8807f1126600 ti: ffff88080d2f0000 task.ti: ffff88080d2f0000 [ 938.134692] RIP: 0010:[<ffffffff813dfa75>] [<ffffffff813dfa75>] bio_split+0x65/0x70 [ 938.136227] RSP: 0018:ffff88080d2f3a18 EFLAGS: 00010246 [ 938.137753] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff880819d18180 [ 938.139281] RDX: 0000000002400000 RSI: 0000000000000000 RDI: ffff88078bf7ccc0 [ 938.140787] RBP: ffff88080d2f3aa0 R08: ffff88081b740800 R09: 0000000000000004 [ 938.142281] R10: ffff88078bf7ccc0 R11: 0000000000000000 R12: 000000000002b000 [ 938.143761] R13: 0000000000000000 R14: 0000000000000015 R15: ffff880814b2d400 [ 938.145239] FS: 00007f92b78ca700(0000) GS:ffff88083fa40000(0000) knlGS:0000000000000000 [ 938.146724] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 938.148209] CR2: 000000000243c288 CR3: 00000007e26e7000 CR4: 00000000001406e0 [ 938.149696] Stack: [ 938.151173] ffffffff813ed34a ffff880819d18180 ffffe8ffffc41f00 ffff88080d2f3a58 [ 938.152682] ffff88078bf7ccc0 ffff88080d2f3ac8 00000000000dbafc 0000000000000000 [ 938.154197] 0000000000000000 ffff88078bf7ccc0 ffff880818fbccc0 000000000001013d [ 938.155718] Call Trace: [ 938.157227] [<ffffffff813ed34a>] ? blk_queue_split+0x22a/0x490 [ 938.158761] [<ffffffff813f29fc>] blk_mq_make_request+0x5c/0x390 [ 938.160294] [<ffffffff8128bdad>] ? do_mpage_readpage+0x42d/0x6e0 [ 938.161822] [<ffffffff813e71f3>] generic_make_request+0xd3/0x180 [ 938.163345] [<ffffffff813e7307>] submit_bio+0x67/0x140 [ 938.164872] [<ffffffff8128c19a>] mpage_readpages+0x13a/0x160 [ 938.166402] [<ffffffff8132b610>] ? fat_detach+0xd0/0xd0 [ 938.167934] [<ffffffff8132b610>] ? fat_detach+0xd0/0xd0 [ 938.169456] [<ffffffff8123876c>] ? alloc_pages_current+0x8c/0x110 [ 938.170984] [<ffffffff8132b84d>] fat_readpages+0x1d/0x20 [ 938.172497] [<ffffffff811fb8c8>] __do_page_cache_readahead+0x168/0x200 [ 938.174001] [<ffffffff811fba30>] ondemand_readahead+0xd0/0x250 [ 938.175503] [<ffffffff811fbd9e>] page_cache_sync_readahead+0x2e/0x50 [ 938.177015] [<ffffffff811f043f>] generic_file_read_iter+0x46f/0x570 [ 938.178533] [<ffffffff811fd4b7>] ? lru_cache_add_active_or_unevictable+0x27/0x80 [ 938.180060] [<ffffffff8121b644>] ? handle_mm_fault+0xe04/0x1440 [ 938.181585] [<ffffffff81250dc7>] __vfs_read+0xa7/0xd0 [ 938.183101] [<ffffffff81251566>] vfs_read+0x86/0x130 [ 938.184612] [<ffffffff81252216>] SyS_read+0x46/0xb0 [ 938.186115] [<ffffffff819de3b6>] entry_SYSCALL_64_fastpath+0x16/0x75 [ 938.187618] Code: 4d 85 ed 74 12 44 89 e6 48 89 df c1 e6 09 41 89 75 28 e8 bf f1 ff ff 5b 4c 89 e8 41 5c 41 5d 5d c3 e8 b0 fc ff ff 49 89 c5 eb d5 <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 [ 938.189260] RIP [<ffffffff813dfa75>] bio_split+0x65/0x70 [ 938.190843] RSP <ffff88080d2f3a18> [ 938.211401] ---[ end trace 1f58ea74114814ec ]--- [ 938.212799] ------------[ cut here ]------------ [ 938.212801] WARNING: CPU: 1 PID: 3424 at kernel/exit.c:661 do_exit+0x50/0xac0() [ 938.212802] Modules linked in: zram [ 938.212803] CPU: 1 PID: 3424 Comm: rsync Tainted: G UD 4.4.0-rc7-GTW+ #1 [ 938.212804] Hardware name: ASUS All Series/SABERTOOTH Z97 MARK 1, BIOS 2702 10/27/2015 [ 938.212805] ffffffff81c2030d ffff88080d2f3738 ffffffff8140eaa4 0000000000000000 [ 938.212806] ffff88080d2f3770 ffffffff8111c662 ffff8807f1126600 000000000000000b [ 938.212807] ffff88080d2f3968 0000000000000000 0000000000000000 ffff88080d2f3780 [ 938.212808] Call Trace: [ 938.212810] [<ffffffff8140eaa4>] dump_stack+0x44/0x60 [ 938.212812] [<ffffffff8111c662>] warn_slowpath_common+0x82/0xc0 [ 938.212813] [<ffffffff8111c75a>] warn_slowpath_null+0x1a/0x20 [ 938.212814] [<ffffffff8111de00>] do_exit+0x50/0xac0 [ 938.212816] [<ffffffff81053ac1>] oops_end+0xa1/0xd0 [ 938.212817] [<ffffffff81053c1b>] die+0x4b/0x70 [ 938.212818] [<ffffffff81050f91>] do_trap+0xb1/0x140 [ 938.212819] [<ffffffff81051097>] do_error_trap+0x77/0xe0 [ 938.212820] [<ffffffff813dfa75>] ? bio_split+0x65/0x70 [ 938.212823] [<ffffffff812824ea>] ? __find_get_block+0xaa/0x100 [ 938.212824] [<ffffffff811f1265>] ? mempool_alloc_slab+0x15/0x20 [ 938.212825] [<ffffffff811f1265>] ? mempool_alloc_slab+0x15/0x20 [ 938.212826] [<ffffffff81051370>] do_invalid_op+0x20/0x30 [ 938.212827] [<ffffffff819dfdae>] invalid_op+0x1e/0x30 [ 938.212828] [<ffffffff813dfa75>] ? bio_split+0x65/0x70 [ 938.212830] [<ffffffff813ef900>] ? __blk_mq_alloc_request+0xe0/0x1e0 [ 938.212831] [<ffffffff813ed34a>] ? blk_queue_split+0x22a/0x490 [ 938.212832] [<ffffffff813f29fc>] blk_mq_make_request+0x5c/0x390 [ 938.212833] [<ffffffff8128bdad>] ? do_mpage_readpage+0x42d/0x6e0 [ 938.212835] [<ffffffff813e71f3>] generic_make_request+0xd3/0x180 [ 938.212836] [<ffffffff813e7307>] submit_bio+0x67/0x140 [ 938.212837] [<ffffffff8128c19a>] mpage_readpages+0x13a/0x160 [ 938.212838] [<ffffffff8132b610>] ? fat_detach+0xd0/0xd0 [ 938.212839] [<ffffffff8132b610>] ? fat_detach+0xd0/0xd0 [ 938.212841] [<ffffffff8123876c>] ? alloc_pages_current+0x8c/0x110 [ 938.212842] [<ffffffff8132b84d>] fat_readpages+0x1d/0x20 [ 938.212844] [<ffffffff811fb8c8>] __do_page_cache_readahead+0x168/0x200 [ 938.212845] [<ffffffff811fba30>] ondemand_readahead+0xd0/0x250 [ 938.212846] [<ffffffff811fbd9e>] page_cache_sync_readahead+0x2e/0x50 [ 938.212847] [<ffffffff811f043f>] generic_file_read_iter+0x46f/0x570 [ 938.212848] [<ffffffff811fd4b7>] ? lru_cache_add_active_or_unevictable+0x27/0x80 [ 938.212850] [<ffffffff8121b644>] ? handle_mm_fault+0xe04/0x1440 [ 938.212851] [<ffffffff81250dc7>] __vfs_read+0xa7/0xd0 [ 938.212852] [<ffffffff81251566>] vfs_read+0x86/0x130 [ 938.212853] [<ffffffff81252216>] SyS_read+0x46/0xb0 [ 938.212854] [<ffffffff819de3b6>] entry_SYSCALL_64_fastpath+0x16/0x75
The block device in question is an Intel 750 NVME SSD: 02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express]) Subsystem: Intel Corporation Device 370d Flags: bus master, fast devsel, latency 0, NUMA node 0 Memory at dfe10000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at dfe00000 [disabled] [size=64K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI-X: Enable+ Count=32 Masked- Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [150] Virtual Channel Capabilities: [180] Power Budgeting <?> Capabilities: [190] Alternative Routing-ID Interpretation (ARI) Capabilities: [270] Device Serial Number 55-cd-2e-41-4c-88-d1-e8 Capabilities: [2a0] #19 Kernel driver in use: nvme
Reverting that single commit seems to fix the problem with mainline. I have what seems to be a consistent way to reproduce this (building the kernel, aptly enough.)
Thanks, we'll take a look.
Created attachment 198601 [details] Segment split patch Can you try this patch?
Thanks. It no longer seems to reproduce with that patch applied.
Created attachment 198681 [details] Fix for split on first bio vector Thanks for the catch. This fails xfstests as well. I have an alternative proposal attached to fix that still splits the command. It's preferable for performance with this hardware that such commands are split.
Created attachment 198691 [details] Re-attaching as a patch.
Retested with patch #2. This also seems to work.
Great, thanks! I'll sync with Jens this week to see which route to go. I recommend mine for a couple reasons. A bio can be split in the middle of a vector, so might as well use the preferred alignment instead of requiring the driver accept the entire vector. And I think there's an issue in Jens' (perhaps only in theory) if the first bio vector's length is greater than the h/w's max transfer size.
I think there's potential for my patch to report the wrong segment count. I'll fix that up and resend to the mailing list after a successful xfstests.
Keith, your approach is the best one, for sure. Let me know when you have the segment part tested, and I can queue up the fix.
Created attachment 198751 [details] patch submitted to list This one passed xfstests that was failing before. The previous patch passed too, but I think that was more coincidence: we still need to split SG page gaps, which wasn't taken into account before.
Thanks, I saw that right after writing here. Looks good to me, queued up.
Closing as fixed.