Bug 71231 - System unresponsable after a lot of LUNs have been added to the system
Summary: System unresponsable after a lot of LUNs have been added to the system
Status: NEW
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-27 17:45 UTC by Alex
Modified: 2021-06-28 07:39 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.10
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel config (78.03 KB, application/octet-stream)
2014-02-27 17:45 UTC, Alex
Details
Kernel messages (73.78 KB, text/plain)
2014-02-27 17:46 UTC, Alex
Details

Description Alex 2014-02-27 17:45:40 UTC
Created attachment 127581 [details]
Kernel config

This Kernel crash is reproducable on my system.
Right now I could get it with 3.10.25 and 3.10.32.
I works fine with 3.1.

The system (Dell R910, nut it also happens with other servers) has a QLogic FC-HBA (qla2xxx driver).
After connection about 30 LUNs the system is working fine, but if I increase the amount of FC-LUNs to for example 40 the system hangs afetr some seconds.

Maybe this bug has been already fixed by Suse:
http://kernel.suse.com/cgit/kernel-source/tree/patches.fixes/mm-resched-to-avoid-rcu-stall-during-boot-large-machines.patch?h=SLE11-SP3&id=6780159bba20e9f99dd5ac8a4e18f98f9c93adf7

If it is the case I think this patch should be put into current kernel lines.
Comment 1 Alex 2014-02-27 17:46:21 UTC
Created attachment 127591 [details]
Kernel messages
Comment 2 Alex 2014-02-27 17:47:11 UTC
Part of the attached kernel messages:

[  119.504736] Code: 00 48 89 f0 48 c7 06 00 00 00 00 48 89 e5 48 87 07 48 85 c0 75 09 c7 46 08 01 00 00 00 5d c3 48 89 30 8b 46 08 85 c0 75 f4 f3 90 <8b> 46 08 85 c0 74 f7 5d c3 0f 1f 44 00 00 48 8b 16 55 48 89 e5 
[  119.504740] NMI backtrace for cpu 44
[  119.504746] CPU: 44 PID: 21877 Comm: lsscsi Not tainted 3.10.32-64bit #1
[  119.504748] Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 1.0.1 02/19/2010
[  119.504751] task: ffff881063510d60 ti: ffff88106351c000 task.ti: ffff88106351c000
[  119.504760] RIP: 0010:[<ffffffff8105e505>]  [<ffffffff8105e505>] mspin_lock+0x35/0x40
[  119.504763] RSP: 0018:ffff88106351db38  EFLAGS: 00000246
[  119.504765] RAX: 0000000000000000 RBX: ffffffff81aa9220 RCX: 0000000000000028
[  119.504768] RDX: ffff88106351dfd8 RSI: ffff88106351db60 RDI: ffffffff81aa9240
[  119.504770] RBP: ffff88106351db38 R08: 70722f3374736f68 R09: fefefefefefefeff
[  119.504772] R10: ffff88086ddc5bc0 R11: 8f8dd0cc8b8c9097 R12: ffff88106351db60
[  119.504774] R13: ffffffff81aa9240 R14: ffff881063510d60 R15: ffff88106351dca8
[  119.504777] FS:  0000000000000000(0000) GS:ffff88046fd60000(0063) knlGS:00000000f762f8d0
[  119.504780] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[  119.504782] CR2: 00000000f777c000 CR3: 00000010684ea000 CR4: 00000000000007e0
[  119.504785] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  119.504788] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  119.504789] Stack:
[  119.504794]  ffff88106351dba8 ffffffff81766d5d ffff88106351db58 ffff88106351dfd8
[  119.504798]  ffff88106351dba8 ffff881066ebfb90 ffff881000000000 ffffffff00000003
[  119.504803]  ffff88106351db98 ffffffff81aa9220 ffff88106351dc98 ffff88046c3a94d0
[  119.504804] Call Trace:
[  119.504812]  [<ffffffff81766d5d>] __mutex_lock_slowpath+0x5d/0x1d0
[  119.504817]  [<ffffffff817667dd>] mutex_lock+0x1d/0x40
[  119.504823]  [<ffffffff81156ba5>] sysfs_dentry_revalidate+0x35/0x100
[  119.504828]  [<ffffffff810ff389>] lookup_fast+0x299/0x2e0
[  119.504833]  [<ffffffff810ff586>] ? __inode_permission+0x46/0x70
[  119.504838]  [<ffffffff810ff79a>] link_path_walk+0x19a/0x810
[  119.504843]  [<ffffffff810ff994>] link_path_walk+0x394/0x810
[  119.504848]  [<ffffffff810b9b8f>] ? release_pages+0x18f/0x1e0
[  119.504853]  [<ffffffff81103094>] path_openat.isra.72+0x94/0x460
[  119.504859]  [<ffffffff810cbf84>] ? tlb_flush_mmu+0x54/0x90
[  119.504864]  [<ffffffff810d2c79>] ? unmap_region+0xd9/0x120
[  119.504869]  [<ffffffff8110368c>] do_filp_open+0x3c/0x90
[  119.504875]  [<ffffffff8110ee52>] ? __alloc_fd+0x42/0x100
[  119.504882]  [<ffffffff810f3a3f>] do_sys_open+0xef/0x1d0
[  119.504888]  [<ffffffff8113a036>] compat_SyS_open+0x16/0x20
[  119.504893]  [<ffffffff8176b598>] sysenter_dispatch+0x7/0x25
[  119.504938] Code: f0 48 c7 06 00 00 00 00 48 89 e5 48 87 07 48 85 c0 75 09 c7 46 08 01 00 00 00 5d c3 48 89 30 8b 46 08 85 c0 75 f4 f3 90 8b 46 08 <85> c0 74 f7 5d c3 0f 1f 44 00 00 48 8b 16 55 48 89 e5 48 85 d2 
[  119.504942] NMI backtrace for cpu 45
[  119.504946] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 3.10.32-64bit #1
[  119.504949] Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 1.0.1 02/19/2010
[  119.504952] task: ffff88046dcd6b00 ti: ffff88046dcf6000 task.ti: ffff88046dcf6000
[  119.504960] RIP: 0010:[<ffffffff8102318a>]  [<ffffffff8102318a>] mwait_idle_with_hints+0x5a/0x70
[  119.504962] RSP: 0018:ffff88046dcf7dd8  EFLAGS: 00000046
[  119.504965] RAX: 0000000000000000 RBX: ffff88106d07d46c RCX: 0000000000000001
[  119.504967] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[  119.504970] RBP: ffff88046dcf7dd8 R08: ffff88046dcf7fd8 R09: 000000000000001c
[  119.504972] R10: 0000000000000166 R11: 00000000016f1e50 R12: 0000000000000001
[  119.504975] R13: ffff88106d07d400 R14: 0000000000000001 R15: ffffffffa0004830
[  119.504978] FS:  0000000000000000(0000) GS:ffff88086fd60000(0000) knlGS:0000000000000000
[  119.504981] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  119.504984] CR2: 00000000f7746000 CR3: 000000046cb57000 CR4: 00000000000007e0
[  119.504986] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  119.504989] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  119.504989] Stack:
[  119.504994]  ffff88046dcf7de8 ffffffff810231cd ffff88046dcf7df8 ffffffffa0001515
[  119.504999]  ffff88046dcf7e28 ffffffffa000164a 00000000000000f5 ffff88106c669000
[  119.505003]  ffffffffa00047c0 0000001aaf4aa38f ffff88046dcf7e88 ffffffff8164e83a
[  119.505004] Call Trace:
[  119.505010]  [<ffffffff810231cd>] acpi_processor_ffh_cstate_enter+0x2d/0x30
[  119.505021]  [<ffffffffa0001515>] acpi_idle_do_entry+0x10/0x2b [processor]
[  119.505028]  [<ffffffffa000164a>] acpi_idle_enter_c1+0x5c/0x81 [processor]
[  119.505035]  [<ffffffff8164e83a>] cpuidle_enter_state+0x4a/0xe0
[  119.505041]  [<ffffffff8164e96e>] cpuidle_idle_call+0x9e/0x160
[  119.505046]  [<ffffffff8106092d>] ? __atomic_notifier_call_chain+0xd/0x10
[  119.505052]  [<ffffffff8100af49>] arch_cpu_idle+0x9/0x30
[  119.505057]  [<ffffffff81073241>] cpu_startup_entry+0x91/0x180
[  119.505063]  [<ffffffff81b4be73>] start_secondary+0x1a0/0x1a4
Comment 3 Alex 2014-02-28 12:21:42 UTC
Kernel 3.8.13 is working fine
Kernel 3.9.11 is working fine
Kernel 3.12.13 has the bug too
Kernel 3.13.5 is working fine again.

So what has changed in 3.13? It would be nice to have it fixed for the longterm kernel versions 3.10 and/or 3.12.


With kernel 3.13 is just get these error messages but the system is not crashing:

[  457.702295] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=2102 jiffies, g=5964, c=5963, q=317961)
[  457.702296] sending NMI to all CPUs:
[  457.702299] NMI backtrace for cpu 3
[  457.702302] CPU: 3 PID: 1363 Comm: kworker/3:2 Not tainted 3.13.5-64bit #2
[  457.702303] Hardware name: Dell Inc. PowerEdge R210/0M877N, BIOS 1.1.4 11/16/2009
[  457.702307] task: ffff88003cc01b00 ti: ffff88003c076000 task.ti: ffff88003c076000
[  457.702312] RIP: 0010:[<ffffffff81073bdc>]  [<ffffffff81073bdc>] dequeue_task_fair+0x27c/0x6f0
[  457.702313] RSP: 0018:ffff88003c077d38  EFLAGS: 00000046
[  457.702314] RAX: ffff88003f591980 RBX: ffff88003f5919f0 RCX: ffff88003f591980
[  457.702314] RDX: ffffffff81c70a40 RSI: ffff88003cc01b68 RDI: ffff88003f5919f0
[  457.702315] RBP: ffff88003c077d88 R08: 0000000000000001 R09: 00000000000002e5
[  457.702315] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
[  457.702316] R13: 0000000000000001 R14: 0000000000000000 R15: ffff88003f5919f0
[  457.702317] FS:  0000000000000000(0000) GS:ffff88003f580000(0000) knlGS:0000000000000000
[  457.702318] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  457.702319] CR2: 00000000000000b8 CR3: 0000000026aa6000 CR4: 00000000000007e0
[  457.702319] Stack:
[  457.702321]  ffff88003c077d68 ffffffff8106f665 0000000100000000 ffff88003f591980
[  457.702322]  ffff88003f591980 ffff88003cc01b00 ffff88003f591980 0000000000000001
[  457.702323]  ffff88003f591200 ffff88003cefe200 ffff88003c077db8 ffffffff8106a73a
[  457.702324] Call Trace:
[  457.702328]  [<ffffffff8106f665>] ? sched_clock_cpu+0xc5/0x120
[  457.702330]  [<ffffffff8106a73a>] dequeue_task+0x5a/0x80
[  457.702331]  [<ffffffff8106adfe>] deactivate_task+0x1e/0x20
[  457.702334]  [<ffffffff817a5239>] __schedule+0x3a9/0x6a0
[  457.702336]  [<ffffffff817a5554>] schedule+0x24/0x70
[  457.702339]  [<ffffffff8105cc53>] worker_thread+0x1c3/0x370
[  457.702340]  [<ffffffff8105ca90>] ? manage_workers.isra.26+0x2a0/0x2a0
[  457.702342]  [<ffffffff81062fc4>] kthread+0xc4/0xe0
[  457.702344]  [<ffffffff81062f00>] ? flush_kthread_worker+0x70/0x70
[  457.702345]  [<ffffffff817a8fcc>] ret_from_fork+0x7c/0xb0
[  457.702346]  [<ffffffff81062f00>] ? flush_kthread_worker+0x70/0x70
[  457.702358] Code: 0f 1f 80 00 00 00 00 49 8b 87 a8 00 00 00 48 8b 80 30 08 00 00 eb ba 0f 1f 84 00 00 00 00 00 48 8b 45 c8 48 8b 4d c8 83 68 04 01 <48> 8b 80 30 08 00 00 48 89 c2 48 2b 91 68 09 00 00 0f 88 75 03 
[  457.702359] NMI backtrace for cpu 1
[  457.702361] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.13.5-64bit #2
[  457.702361] Hardware name: Dell Inc. PowerEdge R210/0M877N, BIOS 1.1.4 11/16/2009
[  457.702362] task: ffff88003f1f3cc0 ti: ffff88003f214000 task.ti: ffff88003f214000
[  457.702367] RIP: 0010:[<ffffffff812f4d75>]  [<ffffffff812f4d75>] delay_tsc+0x45/0x80
[  457.702368] RSP: 0018:ffff88003f483ce8  EFLAGS: 00000046
[  457.702369] RAX: 00000133cef2efb6 RBX: 0000000000000001 RCX: 0000000000000082
[  457.702369] RDX: 0000000000000057 RSI: 0000000000000002 RDI: 00000000002487e6
[  457.702370] RBP: ffff88003f483d08 R08: 0000000000000e81 R09: ffffffff81c85198
[  457.702371] R10: 0000000000000018 R11: 0000000000040000 R12: 00000000cef2ef5f
[  457.702372] R13: 00000000002487e6 R14: 0000000000000001 R15: 0000000000000001
[  457.702373] FS:  0000000000000000(0000) GS:ffff88003f480000(0000) knlGS:0000000000000000
[  457.702373] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  457.702374] CR2: 00000000080f2c58 CR3: 000000002abd6000 CR4: 00000000000007e0
[  457.702375] Stack:
[  457.702376]  0000000000002710 ffffffff81af50c0 ffffffff81b48ae0 ffffffff81af50c0
[  457.702377]  ffff88003f483d18 ffffffff812f4cc9 ffff88003f483d38 ffffffff8102f02e
[  457.702378]  000000000000174b ffff88003f48d4e0 ffff88003f483d98 ffffffff81087b90
[  457.702379] Call Trace:
[  457.702380]  <IRQ> 
[  457.702382]  [<ffffffff812f4cc9>] __const_udelay+0x29/0x30
[  457.702385]  [<ffffffff8102f02e>] arch_trigger_all_cpu_backtrace+0x5e/0x80
[  457.702388]  [<ffffffff81087b90>] rcu_check_callbacks+0x5a0/0x5c0
[  457.702391]  [<ffffffff810509b3>] update_process_times+0x43/0x80
[  457.702394]  [<ffffffff81090d31>] tick_sched_handle.isra.13+0x31/0x40
[  457.702395]  [<ffffffff81090e74>] tick_sched_timer+0x44/0x70
[  457.702398]  [<ffffffff810657af>] __run_hrtimer.isra.30+0x4f/0xd0
[  457.702400]  [<ffffffff81065f85>] hrtimer_interrupt+0xf5/0x230
[  457.702403]  [<ffffffff81032ef1>] hpet_interrupt_handler+0x11/0x30
[  457.702404]  [<ffffffff8107f4f3>] handle_irq_event_percpu+0x43/0x160
[  457.702410]  [<ffffffff8107f64c>] handle_irq_event+0x3c/0x60
[  457.702412]  [<ffffffff81081c5f>] handle_edge_irq+0x6f/0x110
[  457.702414]  [<ffffffff8100491d>] handle_irq+0x1d/0x30
[  457.702415]  [<ffffffff81004725>] do_IRQ+0x55/0xd0
[  457.702418]  [<ffffffff817a88ed>] common_interrupt+0x6d/0x6d
[  457.702419]  <EOI> 
[  457.702421]  [<ffffffff81686c79>] ? cpuidle_enter_state+0x59/0xe0
[  457.702423]  [<ffffffff81686c72>] ? cpuidle_enter_state+0x52/0xe0
[  457.702424]  [<ffffffff81686d9a>] cpuidle_idle_call+0x9a/0x140
[  457.702427]  [<ffffffff8100bbe9>] arch_cpu_idle+0x9/0x30
[  457.702428]  [<ffffffff8107eaa1>] cpu_startup_entry+0x71/0x1a0
[  457.702431]  [<ffffffff8102c3a0>] start_secondary+0x1a0/0x1f0
[  457.702443] Code: 0f ae e8 e8 8e 53 d1 ff 66 90 41 89 c4 eb 16 0f 1f 80 00 00 00 00 f3 90 65 8b 1c 25 1c b0 00 00 41 39 de 75 20 66 66 90 0f ae e8 <e8> 66 53 d1 ff 66 90 89 c2 44 29 e2 44 39 ea 72 da 5b 41 5c 41 
[  457.702444] NMI backtrace for cpu 2
[  457.702446] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.13.5-64bit #2
[  457.702446] Hardware name: Dell Inc. PowerEdge R210/0M877N, BIOS 1.1.4 11/16/2009
[  457.702447] task: ffff88003f1f4380 ti: ffff88003f216000 task.ti: ffff88003f216000
[  457.702451] RIP: 0010:[<ffffffff812e93e0>]  [<ffffffff812e93e0>] cpumask_next_and+0x10/0x50
[  457.702451] RSP: 0018:ffff88003f503bb8  EFLAGS: 00000202
[  457.702452] RAX: 0000000000000001 RBX: ffff88003f0046f8 RCX: ffff88003f1f3cc0
[  457.702453] RDX: ffff88003f50d130 RSI: ffff88003f0046f8 RDI: 0000000000000001
[  457.702453] RBP: ffff88003f503bc8 R08: 0000000000000000 R09: 0000000000000000
[  457.702454] R10: 0000000000000400 R11: 0000000000000000 R12: ffff88003f50d130
[  457.702455] R13: ffff88003f0046e0 R14: 0000000000000001 R15: ffff88003f503db0
[  457.702456] FS:  0000000000000000(0000) GS:ffff88003f500000(0000) knlGS:0000000000000000
[  457.702457] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  457.702457] CR2: 00000000f763688f CR3: 0000000001acd000 CR4: 00000000000007e0
[  457.702458] Stack:
[  457.702459]  0000000000000000 ffff88003f503c10 ffff88003f503d28 ffffffff810748ab
[  457.702460]  ffff88003f503c68 000000008173c358 fffffffffffffff8 0000000000000000
[  457.702462]  ffff88003f0046f8 ffff88003f503c50 ffff88003f503c28 0000000000000000
[  457.702462] Call Trace:
[  457.702463]  <IRQ> 
[  457.702465]  [<ffffffff810748ab>] find_busiest_group+0x12b/0x890
[  457.702467]  [<ffffffff8107516a>] load_balance+0x15a/0x8a0
[  457.702469]  [<ffffffff8106f665>] ? sched_clock_cpu+0xc5/0x120
[  457.702471]  [<ffffffff8106a6bb>] ? update_rq_clock+0x2b/0x50
[  457.702472]  [<ffffffff81075a01>] rebalance_domains+0x151/0x250
[  457.702474]  [<ffffffff81075b45>] run_rebalance_domains+0x45/0x180
[  457.702475]  [<ffffffff8104a0a5>] __do_softirq+0xd5/0x1d0
[  457.702479]  [<ffffffff8104a475>] irq_exit+0x95/0xa0
[  457.702480]  [<ffffffff8100472e>] do_IRQ+0x5e/0xd0
[  457.702482]  [<ffffffff817a88ed>] common_interrupt+0x6d/0x6d
[  457.702483]  <EOI> 
[  457.702485]  [<ffffffff81686c79>] ? cpuidle_enter_state+0x59/0xe0
[  457.702486]  [<ffffffff81686c72>] ? cpuidle_enter_state+0x52/0xe0
[  457.702488]  [<ffffffff81686d9a>] cpuidle_idle_call+0x9a/0x140
[  457.702490]  [<ffffffff8100bbe9>] arch_cpu_idle+0x9/0x30
[  457.702492]  [<ffffffff8107eaa1>] cpu_startup_entry+0x71/0x1a0
[  457.702494]  [<ffffffff8102c3a0>] start_secondary+0x1a0/0x1f0
[  457.702506] Code: 89 65 00 48 8b 45 d8 48 83 c4 10 5b 41 5c 41 5d 41 5e 5d c3 45 31 e4 eb e6 90 90 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 0f 1f 00 <8d> 4f 01 be 40 00 00 00 48 89 df 48 63 d1 e8 cd 13 01 00 3b 05 
[  457.702506] NMI backtrace for cpu 0
[  457.702508] CPU: 0 PID: 29356 Comm: kworker/u8:12 Not tainted 3.13.5-64bit #2
[  457.702509] Hardware name: Dell Inc. PowerEdge R210/0M877N, BIOS 1.1.4 11/16/2009
[  457.702512] Workqueue: events_unbound async_run_entry_fn
[  457.702513] task: ffff8800001e4380 ti: ffff88002f25c000 task.ti: ffff88002f25c000
[  457.702516] RIP: 0010:[<ffffffff81388820>]  [<ffffffff81388820>] io_serial_in+0x10/0x20
[  457.702517] RSP: 0018:ffff88002f25d968  EFLAGS: 00000002
[  457.702518] RAX: 00000133cef2e800 RBX: ffffffff81cc7d60 RCX: 0000000000000000
[  457.702519] RDX: 00000000000003fd RSI: 00000000000003fd RDI: ffffffff81cc7d60
[  457.702519] RBP: ffff88002f25d968 R08: 000000000000000a R09: 0000000000000000
[  457.702520] R10: 0000000000000000 R11: 0000000000000e7c R12: 0000000000002622
[  457.702521] R13: 0000000000000020 R14: ffffffff81cb1e15 R15: 0000000000000035
[  457.702522] FS:  0000000000000000(0000) GS:ffff88003f400000(0000) knlGS:0000000000000000
[  457.702523] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  457.702523] CR2: 00000000080ee9b4 CR3: 0000000001acd000 CR4: 00000000000007f0
[  457.702524] Stack:
[  457.702525]  ffff88002f25d998 ffffffff81388d1b ffff88002f25d9a8 ffffffff81cc7d60
[  457.702526]  0000000000000074 ffffffff81cc7d60 ffff88002f25d9b8 ffffffff81388da0
[  457.702528]  ffffffff81cb1e05 ffffffff81388d80 ffff88002f25d9e8 ffffffff81384742
[  457.702528] Call Trace:
[  457.702530]  [<ffffffff81388d1b>] wait_for_xmitr+0x3b/0xa0
[  457.702532]  [<ffffffff81388da0>] serial8250_console_putchar+0x20/0x40
[  457.702534]  [<ffffffff81388d80>] ? wait_for_xmitr+0xa0/0xa0
[  457.702535]  [<ffffffff81384742>] uart_console_write+0x32/0x70
[  457.702537]  [<ffffffff81389ad1>] serial8250_console_write+0xb1/0x140
[  457.702539]  [<ffffffff8107bf73>] call_console_drivers.constprop.25+0x93/0xb0
[  457.702541]  [<ffffffff8107d2c8>] console_unlock+0x398/0x3d0
[  457.702542]  [<ffffffff8107d665>] vprintk_emit+0x1e5/0x480
[  457.702544]  [<ffffffff812f19d2>] ? put_dec+0x72/0x90
[  457.702546]  [<ffffffff8139ea97>] dev_vprintk_emit+0x57/0x70
[  457.702548]  [<ffffffff812f256e>] ? string.isra.3+0x3e/0xd0
[  457.702550]  [<ffffffff812f3a49>] ? vsnprintf+0x309/0x5f0
[  457.702551]  [<ffffffff8139eae4>] dev_printk_emit+0x34/0x40
[  457.702553]  [<ffffffff8139f309>] __dev_printk+0x59/0x90
[  457.702554]  [<ffffffff8139f620>] dev_printk+0x40/0x50
[  457.702557]  [<ffffffff8142d910>] sd_revalidate_disk+0x580/0x1c30
[  457.702559]  [<ffffffff8106aeb5>] ? check_preempt_curr+0x85/0xa0
[  457.702560]  [<ffffffff8142f085>] sd_probe_async+0xc5/0x1d0
[  457.702562]  [<ffffffff810681b6>] async_run_entry_fn+0x36/0x130
[  457.702564]  [<ffffffff8105b50f>] process_one_work+0x14f/0x3e0
[  457.702565]  [<ffffffff8105cba9>] worker_thread+0x119/0x370
[  457.702567]  [<ffffffff8105ca90>] ? manage_workers.isra.26+0x2a0/0x2a0
[  457.702569]  [<ffffffff81062fc4>] kthread+0xc4/0xe0
[  457.702570]  [<ffffffff81062f00>] ? flush_kthread_worker+0x70/0x70
[  457.702572]  [<ffffffff817a8fcc>] ret_from_fork+0x7c/0xb0
[  457.702573]  [<ffffffff81062f00>] ? flush_kthread_worker+0x70/0x70
[  457.702585] Code: 48 89 e5 d3 e6 48 63 f6 48 03 77 10 8b 06 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f b6 4f 61 55 48 89 e5 d3 e6 03 77 08 89 f2 ec <0f> b6 c0 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f b6 4f 61 55
Comment 4 Alex 2014-02-28 13:09:44 UTC
Kernel 3.13.1 is fine too.
Comment 5 Alex 2014-02-28 16:40:34 UTC
This bug can be "easily" reproduced by havin two machines with one fibre channel card each.
Install scst on one machine (as target) and conenct the other one to this one (via two cables).
Than add 20 LUNs and open them to teh other device via both ports (so the crashing device should get 40 new LUNs) and rescan the SCSI bus.
(
echo 1 > /sys/class/fc_host/host<NR>/issue_lip
echo "- - -" > /sys/class/scsi_host/host<NR>/scan
)

BTW: I have multipath-tools installed also.
Comment 6 Saurav Kashyap 2014-02-28 19:37:36 UTC
HI Alex,
Try "git bisect", it will help in finding the patch that introduced this problem.

Thanks,
~Saurav
Comment 7 Alex 2014-03-05 15:56:21 UTC
ea461abf61753b4b79e625a7c20650105b990f21 is the first bad commit
commit ea461abf61753b4b79e625a7c20650105b990f21
Author: Gavin Shan <shangw@linux.vnet.ibm.com>
Date:   Wed Jun 5 15:34:02 2013 +0800

    powerpc/eeh: Fix fetching bus for single-dev-PE
    
    While running Linux as guest on top of phyp, we possiblly have
    PE that includes single PCI device. However, we didn't return
    its PCI bus correctly and it leads to failure on recovery from
    EEH errors for single-dev-PE. The patch fixes the issue.
    
    Cc: <stable@vger.kernel.org> # v3.7+
    Cc: Steve Best <sbest@us.ibm.com>
    Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
    Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

:040000 040000 8af694ef3a1cc027bef45c99ec9a5592a501e31d 6b6f5b9268e71c63a46385d273656b2419645502 M	arch
Comment 8 Alex 2014-03-06 09:40:04 UTC
If this patch is commited I get the error.
I am still wondering because this patch does not look like it should be relevant.
But before this patch everythign works and with git bisect next (this patch) I will get the errors.
Comment 9 Alex 2014-03-19 10:01:26 UTC
Kernel 3.11.10 is working fine too.
So only 3.10 and 3.12 have this problem (regardless of the subversions).
Comment 10 Alex 2014-03-31 09:57:21 UTC
The new kernel 3.14 works fine too.
So 3.10 and 3.12 have this problem (and my problem is, that I must use the 3.10 line)
Comment 11 Alex 2014-04-09 10:22:24 UTC
Does anyboy has any hitn how I can solve this problem?
It occures with any server and any storage with kernel 3.10 and kernel 3.12.

Using bisect has not given my any logical hint which patch might be the reason why it crashes or why it does not crash anymore.
Comment 12 Hannes Reinecke 2014-04-09 17:50:49 UTC
Comment #3 is the well-known printk lockup.

Try to increase the speed on the serial console.
Or alternatively try the patch by Jan Kara, should've been posted to lkml.

Nothing to do with SCSI, except for the number of messages printed.
Comment 13 Alex 2014-04-11 08:20:21 UTC
Thanks for these tips.
It looks like disabling serial console helps.

But I need this one.
I will try to get his patches working so that I can enbale serial console again.
Comment 14 Alex 2014-04-11 13:15:11 UTC
My Grub entry is:

title           Test
root            (hd0,0)
kernel          /boot/vmlinuz-3.10.25 root=/dev/sdb2 ro console=ttyS0
boot

Removing "console=ttyS0" will fix this problem.
But this isn't a solution for me.
Comment 15 Alex 2014-04-11 13:49:51 UTC
After disabling ACPI Processor and cpuidle, I get these four different stack traces (system is working, but not rebooting anymore for example) several times:


Call Trace:
 [<ffffffff8175bafd>] __mutex_lock_slowpath+0x5d/0x1d0
 [<ffffffff8175bc48>] ? __mutex_lock_slowpath+0x1a8/0x1d0
 [<ffffffff8175b57d>] mutex_lock+0x1d/0x40
 [<ffffffff81152645>] sysfs_dentry_revalidate+0x35/0x100
 [<ffffffff810fa729>] lookup_fast+0x299/0x2e0
 [<ffffffff810fab06>] ? __inode_permission+0x46/0x70
 [<ffffffff810fad1a>] link_path_walk+0x19a/0x800
 [<ffffffff81027fd9>] ? physflat_send_IPI_mask+0x9/0x10
 [<ffffffff810fe604>] path_openat.isra.68+0x94/0x450
 [<ffffffff8175bc48>] ? __mutex_lock_slowpath+0x1a8/0x1d0
 [<ffffffff810febec>] do_filp_open+0x3c/0x90
 [<ffffffff8110a332>] ? __alloc_fd+0x42/0x100
 [<ffffffff810eefef>] do_sys_open+0xef/0x1d0
 [<ffffffff81002441>] ? do_notify_resume+0x51/0x80
 [<ffffffff81135ae6>] compat_SyS_open+0x16/0x20
 [<ffffffff81760318>] sysenter_dispatch+0x7/0x25

Call Trace:
 [<ffffffff8100a7c9>] default_idle+0x9/0x10
 [<ffffffff8100af4a>] arch_cpu_idle+0xa/0x10
 [<ffffffff81071eb1>] cpu_startup_entry+0x91/0x180
 [<ffffffff81cd9ef7>] start_secondary+0x1a0/0x1a4

Call Trace:
 [<ffffffff81112242>] ? simple_read_from_buffer+0x42/0xa0
 [<ffffffff81151b73>] sysfs_read_file+0xe3/0x190
 [<ffffffff810f02e4>] vfs_read+0xa4/0x180
 [<ffffffff810f055d>] SyS_read+0x4d/0x90
 [<ffffffff81760318>] sysenter_dispatch+0x7/0x25

Call Trace:
 [<ffffffff8175bafd>] __mutex_lock_slowpath+0x5d/0x1d0
 [<ffffffff8175b57d>] mutex_lock+0x1d/0x40
 [<ffffffff811511db>] sysfs_getattr+0x2b/0x60
 [<ffffffff810f45e4>] vfs_getattr+0x24/0x40
 [<ffffffff810f4b88>] vfs_fstat+0x38/0x70
 [<ffffffff8110a332>] ? __alloc_fd+0x42/0x100
 [<ffffffff81036385>] sys32_fstat64+0x15/0x30
 [<ffffffff810ef064>] ? do_sys_open+0x164/0x1d0
 [<ffffffff81760318>] sysenter_dispatch+0x7/0x25


I have booted this Multiprocessor server with 40 FC-LUNs connected (multipathing --> 80).

Note You need to log in before you can comment on or make changes to this bug.