Bug 65101

Summary: BFQ iosched bug during high disk I/Os
Product: IO/Storage Reporter: Tomas Thiemel (thiemel)
Component: Block LayerAssignee: Jens Axboe (axboe)
Status: RESOLVED OBSOLETE    
Severity: low CC: szg00000
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.10.7-gentoo-r1 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel's config

Description Tomas Thiemel 2013-11-17 19:56:23 UTC
Created attachment 114981 [details]
kernel's config

# I was doing some BTRFS's disk rebalancing (see "*" at the bottom)
# and I started to upgrade Gentoo packages (extraction of linux kernel
# src to disk). Then BUG:

[75493.207676] ------------[ cut here ]------------
[75493.212427] kernel BUG at block/elevator.c:322!
[75493.217083] invalid opcode: 0000 [#1] SMP
[75493.221320] Modules linked in: tun bridge stp llc f71882fg usbhid dummy coretemp kvm_intel kvm crc32_pclmul crc32c_intel acpi_cpufreq ghash_clmulni_intel aesni_intel r8169 aes_x86_64 lrw mperf ehci_pci xhci_hcd firewire_ohci ehci_hcd]
[75493.261486] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.7-gentoo-r1-xeon-prvni #1
[75493.269442] Hardware name: MSI MS-7759/Z77MA-G45 (MS-7759), BIOS V1.9 03/01/2013
[75493.277042] task: ffffffff81810440 ti: ffffffff81800000 task.ti: ffffffff81800000
[75493.284729] RIP: 0010:[<ffffffff812e3042>]  [<ffffffff812e3042>] elv_rb_del+0x42/0x50
[75493.292795] RSP: 0018:ffff88081f203c38  EFLAGS: 00010046
[75493.298251] RAX: 0000000000000010 RBX: ffff8802f8a35318 RCX: 000000000000000c
[75493.305583] RDX: 0000000000000098 RSI: ffff8802f8a35318 RDI: ffff8801f0fbdcf8
[75493.312913] RBP: ffff8802f8a35398 R08: 0000000000000000 R09: ffffffff8181c480
[75493.320246] R10: 0000000000000003 R11: ffff8802bcc6a000 R12: ffff8801f0fbdcc0
[75493.327578] R13: ffff8807fb76ac00 R14: ffff8807faae1048 R15: ffff8807faae1000
[75493.334909] FS:  0000000000000000(0000) GS:ffff88081f200000(0000) knlGS:0000000000000000
[75493.343222] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75493.349124] CR2: 00007fa6457e5670 CR3: 000000000180b000 CR4: 00000000001427e0
[75493.356454] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[75493.363784] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[75493.371117] Stack:
[75493.373181]  ffff880195e55000 ffff8802f8a35318 ffff8801f0fbdcc0 ffffffff81305c46
[75493.380837]  0000000000000048 ffff8807fa8f1b50 ffff8807fb7fb098 ffff8801f0fbdcc0
[75493.388494]  ffff8802f8a35318 ffff8807faae8f90 ffff8807fb76ac00 ffff8807faae1048
[75493.396149] Call Trace:
[75493.398659]  <IRQ>
[75493.400641]  [<ffffffff81305c46>] ? bfq_remove_request+0x96/0x200
[75493.407110]  [<ffffffff81305de5>] ? bfq_dispatch_insert+0x35/0x70
[75493.413374]  [<ffffffff81306294>] ? bfq_dispatch_requests+0x474/0xab0
[75493.419992]  [<ffffffff813f266f>] ? ata_scsi_translate+0x9f/0x170
[75493.426253]  [<ffffffff812e96b0>] ? blk_peek_request+0x160/0x230
[75493.432424]  [<ffffffff813dc434>] ? scsi_request_fn+0x44/0x540
[75493.438419]  [<ffffffff812e42da>] ? __blk_run_queue+0x2a/0x40
[75493.444324]  [<ffffffff812e6748>] ? blk_run_queue+0x28/0x50
[75493.450050]  [<ffffffff813dafb4>] ? scsi_run_queue+0x114/0x270
[75493.456046]  [<ffffffff813d4cf0>] ? scsi_pool_free_command.isra.11+0x40/0x60
[75493.463290]  [<ffffffff813dd25b>] ? scsi_next_command+0x3b/0x60
[75493.469375]  [<ffffffff8120c/0x6d0
[75493.475726]  [<ffffffff812ee3fb>] ? blk_done_softirq+0x6b/0x80
[75493.481718]  [<ffffffff81041dce>] ? __do_softirq+0xce/0x200
[75493.487442]  [<ffffffff8104201e>] ? irq_exit+0x7e/0xa0
[75493.492722]  [<ffffffff8100435b>] ? do_IRQ+0x5b/0xd0
[75493.504190]  [<ffffffff81554aea>] ? common_interrupt+0x6a/0x6a
[75493.516569]  <EOI>
[75493.518550]  [<ffffffff81443fc6>] ? cpuidle_enter_state+0x56/0xe0
[75493.537770]  [<ffffffff81443fc2>] ? cpuidle_enter_state+0x52/0xe0
[75493.550258]  [<ffffffff814440fe>] ? cpuidle_idle_call+0xae/0x1f0
[75493.562571]  [<ffffffff8100b429>] ? arch_cpu_idle+0x9/0x30
[75493.574341]  [<ffffffff810774fd>] ? cpu_startup_entry+0x7d/0x200
[75493.586471]  [<ffffffff818a5d91>] ? start_kernel+0x37b/0x386
[75493.598037]  [<ffffffff818a582d>] ? repair_env_string+0x5b/0x5b
[75493.609652] Code: 00 48 39 ae 80 00 00 00 74 21 48 89 fe 48 89 ef e8 f4 ce 02 00 48 89 ab 80 00 00 00 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 c3 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 18 48 89 1c
[75493.640895] RIP  [<ffffffff812e3042>] elv_rb_del+0x42/0x50
[75493.652051]  RSP <ffff88081f203c38>
[75493.676630] ---[ end trace c32c7a49c93bdf90 ]---

# OK, the box has rebooted (reboot on panic)
# So I tried to continue with packages upgrade (kernel src extraction to HDD) 
# & executed btrfs rebalance again and... another BUG!

[ 1746.714098] ------------[ cut here ]------------
[ 1746.718835] kernel BUG at block/bfq-iosched.c:1869!
[ 1746.723830] invalid opcode: 0000 [#1] SMP
[ 1746.728047] Modules linked in: tun bridge stp llc f71882fg dummy usbhid coretemp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrwm gf128mul glue_helper ehci_pci ablk_helper xhci_hcd firewire_ohci]
[ 1746.768126] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.7-gentoo-r1-xeon-prvni #1
[ 1746.776057] Hardware name: MSI MS-7759/Z77MA-G45 (MS-7759), BIOS V1.9 03/01/2013
[ 1746.783632] task: ffffffff81810440 ti: ffffffff81800000 task.ti: ffffffff81800000
[ 1746.791294] RIP: 0010:[<ffffffff813065c9>]  [<ffffffff813065c9>] bfq_dispatch_requests+0x7a9/0xab0
[ 1746.800490] RSP: 0018:ffff88081f203cd8  EFLAGS: 00010046
[ 1746.805928] RAX: 00000000000000dd RBX: 0000000000000004 RCX: 0000000000000040
[ 1746.813233] RDX: 00000000000000dd RSI: 0000000000000000 RDI: 0000000000000001
[ 1746.820541] RBP: ffff8807f6c91800 R08: 0000000000000000 R09: ffffffff8181c480
[ 1746.827851] R10: 0000000000000003 R11: ffff8806b4e63600 R12: ffff8807f6cd0730
[ 1746.835158] R13: ffff8806b4e63600 R14: ffff8807f6cc0048 R15: ffff8807f6cc0000
[ 1746.842464] FS:  0000000000000000(0000) GS:ffff88081f200000(0000) knlGS:0000000000000000
[ 1746.850749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1746.856630] CR2: 00007f5969f966f0 CR3: 0000000754eed000 CR4: 00000000001427e0
[ 1746.863938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1746.871244] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1746.878548] Stack:
[ 1746.880605]  ffff8807f6cc0000 ffffffff813f266f ffff8806b4e63600 ffff8807f66b8000
[ 1746.888237]  0000000000000296 ffff8807f6cc0000 0000000000000296 ffff8807f6cb07c8
[ 1746.895863]  ffff8807f6cb07c8 0000000000000000 ffff8806b4e63600 ffff8807f6cc0048dline
[ 1746.903985] Call Trace:
[ 1746.906487]  <IRQ>
[ 1746.908460]  [<ffffffff813f266f>] ? ata_scsi_translate+0x9f/0x170
[ 1746.914935]  [<ffffffff812e96b0>] ? blk_peek_request+0x160/0x230
[ 1746.921085]  [<ffffffff813dc434>] ? scsi_request_fn+0x44/0x540
[ 1746.927059]  [<ffffffff81055a78>] ? insert_work+0x58/0xd0
[ 1746.932589]  [<ffffffff812e42da>] ? __blk_run_queue+0x2a/0x40
[ 1746.938474]  [<ffffffff812e6748>] ? blk_run_queue+0x28/0x50
[ 1746.944178]  [<ffffffff813dafb4>] ? scsi_run_queue+0x114/0x270
[ 1746.950151]  [<ffffffff8111bfcc>] ? kmem_cache_free+0xec/0x140
[ 1746.956125]  [<ffffffff813dd25b>] ? scsi_next_command+0x3b/0x60
[ 1746.962187]  [<ffffffff813dd4dc>] ? scsi_io_completion+0x20c/0x6d0
[ 1746.968516]  [<ffffffff812ee3fb>] ? blk_done_softirq+0x6b/0x80
[ 1746.974487]  [<ffffffff81041dce>] ? __do_softirq+0xce/0x200
[ 1746.980195]  [<ffffffff8104201e>] ? irq_exit+0x7e/0xa0
[ 1746.985455]  [<ffffffff8100435b>] ? do_IRQ+0x5b/0xd0
[ 1746.990541]  [<ffffffff81554aea>] ? common_interrupt+0x6a/0x6a
[ 1746.996511]  <EOI>
[ 1746.998482]  [<ffffffff81443fc6>] ? cpuidle_enter_state+0x56/0xe0
[ 1747.016262]  [<ffffffff81443fc2>] ? cpuidle_enter_state+0x52/0xe0
[ 1747.028087]  [<ffffffff814440fe>] ? cpuidle_idle_call+0xae/0x1f0
[ 1747.039728]  [<ffffffff8100b429>] ? arch_cpu_idle+0x9/0x30
[ 1747.050794]  [<ffffffff810774fd>] ? cpu_startup_entry+0x7d/0x200
[ 1747.062331]  [<ffffffff818a5d91>] ? start_kernel+0x37b/0x386
[ 1747.073503]  [<ffffffff818a582d>] ? repair_env_string+0x5b/0x5b
[ 1747.084786] Code: 81 31 c0 e8 6a cf db ff e9 a1 f8 ff ff 0f 0b 48 89 c7 48 c7 c6 17 67 73 81 31 c0 e8 52 cf db ff 48 8b 7d 00 e9 71 fc ff ff 0f 0b <0f> 0b 4c 89 e7 e8 bd be ff ff b9 03 00 00 00 31 d2 41 89 c1 49
[ 1747.115321] RIP  [<ffffffff813065c9>] bfq_dispatch_requests+0x7a9/0xab0
[ 1747.127143]  RSP <ffff88081f203cd8>
[ 1747.149492] ---[ end trace 610e73f0526a88ff ]---



*) So here is "exact" the scenario (as I had previously some problems with BTRFS):

(rs232) ftp1 # btrfs fi df /
Data, RAID1: total=4.00GB, used=3.00GB
System, RAID1: total=32.00MB, used=4.00KB
Metadata, RAID1: total=1.00GB, used=609.41MB

(rs232) ftp1 # btrfs fi show  --all-devices  BTRFSroot
Label: 'BTRFSroot'  uuid: cfba9ed4-18fa-4d0e-9afb-5932af7064ad
        Total devices 2 FS bytes used 3.59GB
        devid    6 size 92.50GB used 5.03GB path /dev/sdg3
        devid    7 size 92.50GB used 5.03GB path /dev/sdh3
Btrfs v0.20-rc1-358-g194aa4a

(rs232) ftp1 # btrfs fi df /DATA
Data, RAID1: total=4.23TB, used=4.23TB
Data: total=1.38TB, used=1.21TB
System: total=36.00MB, used=800.00KB
Metadata: total=10.01GB, used=8.84GB

(rs232) ftp1 # btrfs fi show  --all-devices  DATA
Label: 'DATA'  uuid: 49e7271f-c0c4-4708-9dfa-4e385257b544
        Total devices 4 FS bytes used 5.45TB
        devid    2 size 2.60TB used 2.46TB path /dev/sdb4
        devid    1 size 2.60TB used 2.46TB path /dev/sdc4
        devid    4 size 2.60TB used 2.46TB path /dev/sdg4
        devid    3 size 2.60TB used 2.46TB path /dev/sdh4
Btrfs v0.20-rc1-358-g194aa4a

# I EXECUTED:
(rs232) ftp1 # btrfs balance start -dconvert=single -mconvert=single -sconvert=single -f /DATA

# and during rebalance and I did
(rs232) ftp1 # emerge -uavDN @world

# which downloaded, extracted & autocompiled some binaries from source codes including Linux kernel sources to /usr/src/; but the machine stopped to respond - "luckily" I saw the error in the console. OK, so I trusted the BTRFS and continued - executed BTRFS re-balance & emerge again. Then second bug appeared.

PS: I hope it's just BFQ issue, so I'm going to recompile kernel with CFQ, and I hope I won't see any other kernel bug today nor tomorrow :)
Comment 1 Tomas Thiemel 2013-12-02 05:46:04 UTC
Confirmed, it's BFQ issue. After kernel recompilation with CFQ everything works smoothly.