Bug 199003

Summary: console stalled, cause Hard LOCKUP.
Product: IO/Storage Reporter: Wen Yang (wenyang)
Component: SCSIAssignee: io_other
Status: NEW ---    
Severity: high CC: bvanassche, jack, jejb, jiang.biao2, linux-scsi, martin.petersen, pmladek, rostedt, sergey.senozhatsky, tj, wenyang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: all Tree: Mainline
Regression: No
Attachments: vmcore-dmesg
runq
foreach_bt

Description Wen Yang 2018-03-05 07:47:52 UTC
Watchdog detected hard LOCKUP on cpu 16:
[  557.369547] Kernel panic - not syncing: Hard LOCKUP
[  557.369548] CPU: 16 PID: 11511 Comm: kubelet Tainted: G        W  OE  ------------ T 3.10.0-514.26.2.el7.x86_64 #1
[  557.369548] Hardware name: ZTE EC600G3/XH20G2, BIOS UBF03.11.67_SVN65859 12/29/2016
[  557.369549]  ffffffff818daafa 0000000061eaf60c ffff88407d485b18 ffffffff816873ff
[  557.369550]  ffff88407d485b98 ffffffff816807ea 0000000b00000010 ffff88407d485ba8
[  557.369551]  ffff88407d485b48 0000000061eaf60c ffff88407d485b88 ffffffff818da800
[  557.369551] Call Trace:
[  557.369552]  <NMI>  [<ffffffff816873ff>] dump_stack+0x19/0x1b
[  557.369555]  [<ffffffff816807ea>] panic+0xe3/0x1f2
[  557.369558]  [<ffffffff81085abf>] nmi_panic+0x3f/0x40
[  557.369559]  [<ffffffff8112f999>] watchdog_overflow_callback+0xf9/0x100
[  557.369560]  [<ffffffff81174e4e>] __perf_event_overflow+0x8e/0x1f0
[  557.369562]  [<ffffffff81175a94>] perf_event_overflow+0x14/0x20
[  557.369563]  [<ffffffff81009d88>] intel_pmu_handle_irq+0x1f8/0x4e0
[  557.369564]  [<ffffffff8131ab2c>] ? ioremap_page_range+0x27c/0x3e0
[  557.369565]  [<ffffffff811bf774>] ? vunmap_page_range+0x1c4/0x310
[  557.369566]  [<ffffffff811bf8d1>] ? unmap_kernel_range_noflush+0x11/0x20
[  557.369567]  [<ffffffff813ca434>] ? ghes_copy_tofrom_phys+0x124/0x210
[  557.369568]  [<ffffffff813ca5c0>] ? ghes_read_estatus+0xa0/0x190
[  557.369569]  [<ffffffff8168f02b>] perf_event_nmi_handler+0x2b/0x50
[  557.369571]  [<ffffffff81690477>] nmi_handle.isra.0+0x87/0x160
[  557.369572]  [<ffffffff8130f380>] ? cfqg_print_leaf_weight_device+0x40/0x40
[  557.369573]  [<ffffffff81690723>] do_nmi+0x1d3/0x410
[  557.369574]  [<ffffffff8130f380>] ? cfqg_print_leaf_weight_device+0x40/0x40
[  557.369575]  [<ffffffff8168f893>] end_repeat_nmi+0x1e/0x2e
[  557.369576]  [<ffffffff8130f380>] ? cfqg_print_leaf_weight_device+0x40/0x40
[  557.369577]  [<ffffffff8168ee6a>] ? _raw_spin_lock_irq+0x3a/0x60
[  557.369577]  [<ffffffff8168ee6a>] ? _raw_spin_lock_irq+0x3a/0x60
[  557.369579]  [<ffffffff8168ee6a>] ? _raw_spin_lock_irq+0x3a/0x60
[  557.369580]  <<EOE>>  [<ffffffff81309336>] blkcg_print_blkgs+0x76/0xf0   ----》 wait for blkg->q->queue_lock
[  557.369581]  [<ffffffff8130f236>] cfqg_print_rwstat_recursive+0x36/0x40
[  557.369583]  [<ffffffff81109393>] cgroup_seqfile_show+0x73/0x80
[  557.369584]  [<ffffffff81222b57>] ? seq_buf_alloc+0x17/0x40
[  557.369585]  [<ffffffff8122305a>] seq_read+0x10a/0x3b0
[  557.369586]  [<ffffffff811fe9be>] vfs_read+0x9e/0x170
[  557.369587]  [<ffffffff811ff58f>] SyS_read+0x7f/0xe0
[  557.369588]  [<ffffffff81697ac9>] system_call_fastpath+0x16/0x1b

source code:
void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
		       u64 (*prfill)(struct seq_file *,
				     struct blkg_policy_data *, int),
		       const struct blkcg_policy *pol, int data,
		       bool show_total)
{
	struct blkcg_gq *blkg;
	u64 total = 0;

	rcu_read_lock();
	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
		spin_lock_irq(blkg->q->queue_lock);
	......
}
	

Another process,  "kworker/u113:1" acquired q->queue_lock and wait for console_write to finish:

PID: 319    TASK: ffff881ffb09edd0  CPU: 7   COMMAND: "kworker/u113:1"
 #0 [ffff881fffbc5e48] crash_nmi_callback at ffffffff8104d342
 #1 [ffff881fffbc5e58] nmi_handle at ffffffff81690477
 #2 [ffff881fffbc5eb0] do_nmi at ffffffff81690683
 #3 [ffff881fffbc5ef0] end_repeat_nmi at ffffffff8168f893
    [exception RIP: delay_tsc+68]
    RIP: ffffffff81326724  RSP: ffff881ffb0b7540  RFLAGS: 00000046
    RAX: 0000000000000007  RBX: ffffffff81f89080  RCX: 0066a7d6eb5cfc9d
    RDX: 0066a7d6eb5d0629  RSI: 0000000000000007  RDI: 0000000000000a23
    RBP: ffff881ffb0b7540   R8: 0000000000000000   R9: 00000000ffffffff
    R10: 00000000000003ff  R11: 0000000000000001  R12: 00000000000026f3
    R13: 0000000000000020  R14: ffffffff81d3c0fb  R15: 000000000000003b
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff881ffb0b7540] delay_tsc at ffffffff81326724
 #5 [ffff881ffb0b7548] __const_udelay at ffffffff81326678
 #6 [ffff881ffb0b7558] wait_for_xmitr at ffffffff814056e0
 #7 [ffff881ffb0b7580] serial8250_console_putchar at ffffffff814058ac
 #8 [ffff881ffb0b75a0] uart_console_write at ffffffff8140035a
 #9 [ffff881ffb0b75d0] serial8250_console_write at ffffffff814057fe
#10 [ffff881ffb0b7618] call_console_drivers.constprop.17 at ffffffff81087011
#11 [ffff881ffb0b7640] console_unlock at ffffffff810889e9
#12 [ffff881ffb0b7680] vprintk_emit at ffffffff81088df4
#13 [ffff881ffb0b76f0] dev_vprintk_emit at ffffffff81428e72
#14 [ffff881ffb0b77a8] dev_printk_emit at ffffffff81428eee
#15 [ffff881ffb0b7808] __dev_printk at ffffffff8142937e
#16 [ffff881ffb0b7818] dev_printk at ffffffff8142942d
#17 [ffff881ffb0b7888] sdev_prefix_printk at ffffffff81463771
#18 [ffff881ffb0b7918] scsi_prep_state_check at ffffffff814598e4
#19 [ffff881ffb0b7928] scsi_prep_fn at ffffffff8145992d
#20 [ffff881ffb0b7960] blk_peek_request at ffffffff812f0826
#21 [ffff881ffb0b7988] scsi_request_fn at ffffffff8145b588
#22 [ffff881ffb0b79f0] __blk_run_queue at ffffffff812ebd63
#23 [ffff881ffb0b7a08] blk_queue_bio at ffffffff812f1013        -----》acquired q->queue_lock and wait for console_write to finish
#24 [ffff881ffb0b7a50] generic_make_request at ffffffff812ef209
#25 [ffff881ffb0b7a98] submit_bio at ffffffff812ef351
#26 [ffff881ffb0b7af0] xfs_submit_ioend_bio at ffffffffa0146a63 [xfs]
#27 [ffff881ffb0b7b00] xfs_submit_ioend at ffffffffa0146b31 [xfs]
#28 [ffff881ffb0b7b40] xfs_vm_writepages at ffffffffa0146e18 [xfs]
#29 [ffff881ffb0b7bb8] do_writepages at ffffffff8118da6e
#30 [ffff881ffb0b7bc8] __writeback_single_inode at ffffffff812293a0
#31 [ffff881ffb0b7c08] writeback_sb_inodes at ffffffff8122a08e
#32 [ffff881ffb0b7cb0] __writeback_inodes_wb at ffffffff8122a2ef
#33 [ffff881ffb0b7cf8] wb_writeback at ffffffff8122ab33
#34 [ffff881ffb0b7d70] bdi_writeback_workfn at ffffffff8122cb2b
#35 [ffff881ffb0b7e20] process_one_work at ffffffff810a851b
#36 [ffff881ffb0b7e68] worker_thread at ffffffff810a9356
#37 [ffff881ffb0b7ec8] kthread at ffffffff810b0b6f
#38 [ffff881ffb0b7f50] ret_from_fork at ffffffff81697a18

source code:
void blk_queue_bio(struct request_queue *q, struct bio *bio)
{
...
	} else {
		spin_lock_irq(q->queue_lock);
		add_acct_request(q, req, where);
		__blk_run_queue(q);
out_unlock:
		spin_unlock_irq(q->queue_lock);
	}
}

void console_unlock(void)
{
	static char text[LOG_LINE_MAX + PREFIX_MAX];
...
	call_console_drivers(level, text, len);
...
}	
	
I also analysis this vmcore, and found static char text[LOG_LINE_MAX + PREFIX_MAX]:

crash> rd 0xffffffff81d3c0c0 100
ffffffff81d3c0c0:  322e36343520205b 73205d3538313031   [  546.210185] s
ffffffff81d3c0d0:  3a303a313a302064 63656a6572203a30   d 0:1:0:0: rejec
ffffffff81d3c0e0:  4f2f4920676e6974 6c66666f206f7420   ting I/O to offl
ffffffff81d3c0f0:  6976656420656e69 5345525f440a6563   ine device.D_RES
ffffffff81d3c100:  6576697264205445 52443d6574796272   ET driverbyte=DR
ffffffff81d3c110:  0a4b4f5f52455649 286e6f0a4b0a0a65   IVER_OK.e..K.on(
ffffffff81d3c120:  30302e30302e3431 6564690a2930302e   14.00.00.00).ide
ffffffff81d3c130:  656d75636f642064 206e6f697461746e   d documentation
ffffffff81d3c140:  696d696c20726f66 2e736e6f69746174   for limitations.
ffffffff81d3c150:  4c54554e472b200a 2b204c43412b2053   . +GNUTLS +ACL +
ffffffff81d3c160:  20345a4c2d205a58 504d4f434345532d   XZ -LZ4 -SECCOMP
ffffffff81d3c170:  2044494b4c422b20 4c495455464c452b    +BLKID +ELFUTIL
ffffffff81d3c180:  20444f4d4b2b2053 6e650a294e44492b   S +KMOD +IDN).en
ffffffff81d3c190:  2d4654552e53555f 0000000000000a38   _US.UTF-8.......
ffffffff81d3c1a0:  0000000000000000 0000000000000000   ................
ffffffff81d3c1b0:  0000000000000000 0000000000000000   ................
ffffffff81d3c1c0:  0000000000000000 0000000000000000   ................
ffffffff81d3c1d0:  0000000000000000 0000000000000000   ................


So, the "kworker/u113:1"‘s wait time  after acquired  q->queue_lock  is:
557 - 546 = 11
cause  Hard LOCKUP.
Comment 1 Wen Yang 2018-03-05 07:50:01 UTC
I think this patch may fix it:
https://lkml.org/lkml/2016/4/22/487
Comment 2 sergey.senozhatsky.work 2018-03-05 09:27:22 UTC
Hello,

Let's switch to email.

Cc-ing Petr and Steven


On (03/05/18 07:50), bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=199003
> 
> yangwen (wen.yang99@zte.com.cn) changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |sergey.senozhatsky@gmail.com

Thanks, will take a look.

In the meantime, could you check this message:
  https://marc.info/?l=linux-kernel&m=152016852803716

It lists several commit IDs that you might backport and try out.
Thanks in advance!

	-ss
Comment 3 Wen Yang 2018-03-20 09:34:44 UTC
(In reply to sergey.senozhatsky.work from comment #2)
> Hello,
> 
> Let's switch to email.
> 
> Cc-ing Petr and Steven
> 
> 
> On (03/05/18 07:50), bugzilla-daemon@bugzilla.kernel.org wrote:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=199003
> > 
> > yangwen (wen.yang99@zte.com.cn) changed:
> > 
> >            What    |Removed                     |Added
> >
> ----------------------------------------------------------------------------
> >                  CC|                           
> |sergey.senozhatsky@gmail.com
> 
> Thanks, will take a look.
> 
> In the meantime, could you check this message:
>   https://marc.info/?l=linux-kernel&m=152016852803716
> 
> It lists several commit IDs that you might backport and try out.
> Thanks in advance!
> 
>       -ss

Thanks very much.
commit e480af09c49736848f749a43dff2c902104f6691 avoided the NMI watchdog
trigger.

And this patch may  avdoid long time blocking:
https://lkml.org/lkml/2018/3/8/584

We've test it several days.
Regards,
Comment 4 sergey.senozhatsky.work 2018-03-21 07:28:59 UTC
On (03/20/18 09:34), bugzilla-daemon@bugzilla.kernel.org wrote:
[..]
> Thanks very much.
> commit e480af09c49736848f749a43dff2c902104f6691 avoided the NMI watchdog
> trigger.

Hm, okay... But "touch_nmi_watchdog() everywhere printk/console-related"
is not exactly where I wanted us to be.

By the way e480af09c49736848f749a43dff2c902104f6691 is from 2006.
Are you sure you meant exactly that commit? What kernel do you use?


Are you saying that none of Steven's patches helped on your setups?


> And this patch may  avdoid long time blocking:
> https://lkml.org/lkml/2018/3/8/584
> 
> We've test it several days.

Hm, printk_deferred is a bit dangerous; it moves console_unlock() to
IRQ. So you still can have the problem of stuck CPUs, it's just now
you shut up the watchdog. Did you test Steven's patches?


A tricky part about printk_deferred() is that it does not use hand off
mechanism. And even more...  What we have with "printk vs printk"
sceanrio

	CPU0			CPU1		...		CPUN

	printk			printk
	 console_unlock 	 hand off			printk
	 			  console_unlock		 hand off
								  console_unlock

turns into a good old "one CPU prints it all" when we have "printk vs
printk_deferred" case. Because printk_deferred just log_store messages
and then _may be_ it grabs the console_sem from IRQ and invokes
console_unlock().

So it's something like this

	CPU0			CPU1		...		CPUN

	printk			printk_deffered
	 console_unlock						printk_deferred
	 console_unlock
	 console_unlock
	...			...				...
				printk_deffered			printk_deferred
	 console_unlock
	 console_unlock


// offtopic  "I can has printk_kthread?"



You now touch_nmi_watchdog() from the console driver [well... at least this
is what e480af09c4973 is doing, but I'm not sure I see how come you didn't
have it applied], so that's why you don't see hard lockups on that CPU0. But
your printing CPU still can stuck, which will defer RCUs on that CPU, etc.
etc. etc. So I'd say that those two approaches

		printk_deferred + touch_nmi_watchdog

combined can do quite some harm. One thing for sure - they don't really fix
any problems.

	-ss
Comment 5 Wen Yang 2018-03-21 09:24:55 UTC
(In reply to sergey.senozhatsky.work from comment #4)
> On (03/20/18 09:34), bugzilla-daemon@bugzilla.kernel.org wrote:
> [..]
> > Thanks very much.
> > commit e480af09c49736848f749a43dff2c902104f6691 avoided the NMI watchdog
> > trigger.
> 
> Hm, okay... But "touch_nmi_watchdog() everywhere printk/console-related"
> is not exactly where I wanted us to be.
> 
> By the way e480af09c49736848f749a43dff2c902104f6691 is from 2006.
> Are you sure you meant exactly that commit? What kernel do you use?


I'm so sorry, it's this commit:

commit 54f19b4a679149130f78413c421a5780e90a9d0a
Author: Jiri Olsa <jolsa@kernel.org>
Date:   Wed Sep 21 16:43:15 2016 +0200

    tty/serial/8250: Touch NMI watchdog in wait_for_xmitr

    First loop in wait_for_xmitr could also trigger NMI
    watchdog in case reading from the port is slow:

      PID: 0      TASK: ffffffff819c1460  CPU: 0   COMMAND: "swapper/0"
       #0 [ffff88019f405e58] crash_nmi_callback at ffffffff8104d382
       #1 [ffff88019f405e68] nmi_handle at ffffffff8168ead9
       #2 [ffff88019f405eb0] do_nmi at ffffffff8168ec53
       #3 [ffff88019f405ef0] end_repeat_nmi at ffffffff8168df13
          [exception RIP: delay_tsc+50]
          RIP: ffffffff81325642  RSP: ffff88019f403bb0  RFLAGS: 00000083
          RAX: 00000000000005c8  RBX: ffffffff81f83000  RCX: 0000024e4fb88a8b
          RDX: 0000024e4fb89053  RSI: 0000000000000000  RDI: 00000000000007d1
          RBP: ffff88019f403bb0   R8: 000000000000000a   R9: 0000000000000000
          R10: 0000000000000000  R11: ffff88019f403ad6  R12: 000000000000250f
          R13: 0000000000000020  R14: ffffffff81d360c7  R15: 0000000000000047
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #4 [ffff88019f403bb0] delay_tsc at ffffffff81325642
       #5 [ffff88019f403bb8] __const_udelay at ffffffff813255a8
       #6 [ffff88019f403bc8] wait_for_xmitr at ffffffff81404390
       #7 [ffff88019f403bf0] serial8250_console_putchar at ffffffff8140455c
       #8 [ffff88019f403c10] uart_console_write at ffffffff813ff00a
       #9 [ffff88019f403c40] serial8250_console_write at ffffffff814044ae
      #10 [ffff88019f403c88] call_console_drivers.constprop.15 at ffffffff81086b01
      #11 [ffff88019f403cb0] console_unlock at ffffffff8108842f
      #12 [ffff88019f403ce8] vprintk_emit at ffffffff81088834
      #13 [ffff88019f403d58] vprintk_default at ffffffff81088ba9
      #14 [ffff88019f403d68] printk at ffffffff8167f034

    Adding touch_nmi_watchdog call to the first loop as well. 


> Are you saying that none of Steven's patches helped on your setups?
> 
> 
> > And this patch may  avdoid long time blocking:
> > https://lkml.org/lkml/2018/3/8/584
> > 
> > We've test it several days.
> 
> Hm, printk_deferred is a bit dangerous; it moves console_unlock() to
> IRQ. So you still can have the problem of stuck CPUs, it's just now
> you shut up the watchdog. Did you test Steven's patches?
> 

Steven's patches is helpful for us.
We've backported several patches, will test it soon.

> A tricky part about printk_deferred() is that it does not use hand off
> mechanism. And even more...  What we have with "printk vs printk"
> sceanrio
> 
>       CPU0                    CPU1            ...             CPUN
> 
>       printk                  printk
>        console_unlock          hand off                       printk
>                                 console_unlock                 hand off
>                                                                
> console_unlock
> 
> turns into a good old "one CPU prints it all" when we have "printk vs
> printk_deferred" case. Because printk_deferred just log_store messages
> and then _may be_ it grabs the console_sem from IRQ and invokes
> console_unlock().
> 
> So it's something like this
> 
>       CPU0                    CPU1            ...             CPUN
> 
>       printk                  printk_deffered
>        console_unlock                                         printk_deferred
>        console_unlock
>        console_unlock
>       ...                     ...                             ...
>                               printk_deffered                 printk_deferred
>        console_unlock
>        console_unlock
> 
> 
> // offtopic  "I can has printk_kthread?"
> 
> 
> 
> You now touch_nmi_watchdog() from the console driver [well... at least this
> is what e480af09c4973 is doing, but I'm not sure I see how come you didn't
> have it applied], so that's why you don't see hard lockups on that CPU0. But
> your printing CPU still can stuck, which will defer RCUs on that CPU, etc.
> etc. etc. So I'd say that those two approaches
> 
>               printk_deferred + touch_nmi_watchdog
> 
> combined can do quite some harm. One thing for sure - they don't really fix
> any problems.
> 
>       -ss

Thanks very much. That's right. 
But as you known, printk.synchronous works for the whole system, it may cause other problems, and need long time testing.
So we also need a simple workaround to just avoid this bug.

Regards,
Comment 6 sergey.senozhatsky.work 2018-03-22 02:14:43 UTC
Hello Steven,

On (03/21/18 09:44), Steven Rostedt wrote:
[..]
> >  #4 [ffff881ffb0b7540] delay_tsc at ffffffff81326724
> >  #5 [ffff881ffb0b7548] __const_udelay at ffffffff81326678
> >  #6 [ffff881ffb0b7558] wait_for_xmitr at ffffffff814056e0
> >  #7 [ffff881ffb0b7580] serial8250_console_putchar at ffffffff814058ac
> >  #8 [ffff881ffb0b75a0] uart_console_write at ffffffff8140035a
> >  #9 [ffff881ffb0b75d0] serial8250_console_write at ffffffff814057fe
> > #10 [ffff881ffb0b7618] call_console_drivers.constprop.17 at
> ffffffff81087011
> > #11 [ffff881ffb0b7640] console_unlock at ffffffff810889e9
> > #12 [ffff881ffb0b7680] vprintk_emit at ffffffff81088df4
> > #13 [ffff881ffb0b76f0] dev_vprintk_emit at ffffffff81428e72
> > #14 [ffff881ffb0b77a8] dev_printk_emit at ffffffff81428eee
> > #15 [ffff881ffb0b7808] __dev_printk at ffffffff8142937e
> > #16 [ffff881ffb0b7818] dev_printk at ffffffff8142942d
> > #17 [ffff881ffb0b7888] sdev_prefix_printk at ffffffff81463771
> > #18 [ffff881ffb0b7918] scsi_prep_state_check at ffffffff814598e4
> > #19 [ffff881ffb0b7928] scsi_prep_fn at ffffffff8145992d
> > #20 [ffff881ffb0b7960] blk_peek_request at ffffffff812f0826
> > #21 [ffff881ffb0b7988] scsi_request_fn at ffffffff8145b588
> > #22 [ffff881ffb0b79f0] __blk_run_queue at ffffffff812ebd63
> > #23 [ffff881ffb0b7a08] blk_queue_bio at ffffffff812f1013       
> -----》acquired q->queue_lock and wait for console_write to finish
> 
> What do you mean by "wait for console_write to finish" here?

*My guess*

I suppose a typical "printk->console_unlock() under spin_lock" thing

	CPU0						CPU1
	spin_lock(&lock)
	 printk						spin_lock(&lock)
	  console_unlock()
	   for (;;)
	    call_console_drivers()

While other CPUs printk->log_store() and some (or may be just one) is locked
up on spin_lock(&lock). So we have several lockup vectors here - one is
console_unlock() under spin_lock and the other one is CPUs spinning on the
very same spin_lock. So handoff should help here, no doubt. But it cannot help
if other CPUs will start to printk_deferred() instead of printk().


Looking at
 printk()->call_console_drivers()->serial8250_console_putchar()->wait_for_xmitr()

... wait_for_xmitr() can spin for over 1 second waiting for the UART_MSR_CTS
bit.

static void wait_for_xmitr(struct uart_8250_port *up, int bits)
{
	unsigned int status, tmout = 10000;

	/* Wait up to 10ms for the character(s) to be sent. */
	for (;;) {
		status = serial_in(up, UART_LSR);

		up->lsr_saved_flags |= status & LSR_SAVE_FLAGS;

		if ((status & bits) == bits)
			break;
		if (--tmout == 0)
			break;
		udelay(1);
		touch_nmi_watchdog();
	}

	/* Wait up to 1s for flow control if necessary */
	if (up->port.flags & UPF_CONS_FLOW) {
		for (tmout = 1000000; tmout; tmout--) {
			unsigned int msr = serial_in(up, UART_MSR);
			up->msr_saved_flags |= msr & MSR_SAVE_FLAGS;
			if (msr & UART_MSR_CTS)
				break;
			udelay(1);
			touch_nmi_watchdog();
		}
	}
	...
}

a 1+ second long busy loop in the console driver is quite close to
"problems guaranteed". But, wait, there is even more. This wait_for_xmitr()
busy wait is happening after every character we print on the console. So
printk("foo") will generate 5 * wait_for_xmitr() busy loops [foo + \r + \n].
They punch&touch watchdog a lot, so at the least the system won't get killed
by the hardlockup detector. But at the same time, it's still potentially a
1+ second busy loop in the console driver * strlen(message).

Sometimes I really wish we had detached consoles. Direct printk()->console
is nice and cool, but... we can't have it. I don't want to pressure for
printk_kthread, but look at all those consoles. There are not so many ways
out. What do you think?

	-ss
Comment 7 sergey.senozhatsky.work 2018-03-22 02:35:04 UTC
On (03/22/18 11:14), Sergey Senozhatsky wrote:
[..]
> Looking at
> 
>  printk()->call_console_drivers()->serial8250_console_putchar()->wait_for_xmitr()
> 
> ... wait_for_xmitr() can spin for over 1 second waiting for the UART_MSR_CTS
> bit.

[..]

> a 1+ second long busy loop in the console driver is quite close to
> "problems guaranteed". But, wait, there is even more. This wait_for_xmitr()
> busy wait is happening after every character we print on the console. So
> printk("foo") will generate 5 * wait_for_xmitr() busy loops [foo + \r + \n].
> They punch&touch watchdog a lot, so at the least the system won't get killed
> by the hardlockup detector. But at the same time, it's still potentially a
> 1+ second busy loop in the console driver * strlen(message).

One does not even need to have concurrent printk()-s in this case. A
single CPU doing several direct printks under spin_lock is already
enough:

	CPUA						CPUB ~ CPUZ
	spin_lock(&lock)
	 printk->wait_for_xmitr				spin_lock(&lock)
	 printk->wait_for_xmitr
	 ...
	 printk->wait_for_xmitr				<< lockups >>
	 printk->wait_for_xmitr
	spin_unlock(&lock)

> Sometimes I really wish we had detached consoles. Direct printk()->console
> is nice and cool, but... we can't have it.

And this is, basically, what they do with printk_deferred(). We usually
use it to avoid deadlocks, but in this particular case it's used due to
the fact that direct printk() is way too painful, so they are detaching
printout and move it to another control path. Quite an interesting idea,
I must say.

	-ss
Comment 8 sergey.senozhatsky.work 2018-03-23 12:06:29 UTC
Hello Steven,

On (03/22/18 18:25), Steven Rostedt wrote:
> > static void wait_for_xmitr(struct uart_8250_port *up, int bits)
> > {
> >     unsigned int status, tmout = 10000;
> > 
> >     /* Wait up to 10ms for the character(s) to be sent. */
> >     for (;;) {
			....
> >             if (--tmout == 0)
> >                     break;
> >             udelay(1);
> >             touch_nmi_watchdog();
> >     }
> > 
> >     /* Wait up to 1s for flow control if necessary */
> >     if (up->port.flags & UPF_CONS_FLOW) {
> >             for (tmout = 1000000; tmout; tmout--) {
			....
> >                     udelay(1);
> >                     touch_nmi_watchdog();
> >             }
> >     }
> >     ...
> > }
> > 
> > a 1+ second long busy loop in the console driver is quite close to
> 
> Yeah that's nasty but shouldn't ever hit 1 second unless there's
> hardware issues.

Would probably agree. But even without that 1 second UPF_CONS_FLOW loop
(it seems that there are not so many consoles that have that flag set),
we have a 10ms [10000 * udelay(1)] loop after every character. So if we
print a 512-bytes kernel message [1/2 of a max printk message length]
than its 0.01 * (512 + \r + \n) seconds. Which is a lot, unless I
miscalculated 10000 * udelay(1). Delays in the first loop are normal,
probably not always 10ms, tho, but still.

[..]
> > Sometimes I really wish we had detached consoles. Direct printk()->console
> > is nice and cool, but... we can't have it. I don't want to pressure for
> > printk_kthread, but look at all those consoles. There are not so many ways
> > out. What do you think?
> 
> If anything, perhaps we could have a printk thread that is triggered to
> wake up if any print is taking a long time, or if there's a lot of
> prints happening. And then it could try to print anything in the
> buffer. Just the mere printk() in the thread should do the hand off.

OK. Let's think more about this. I have that "wake up printk_kthread if
any print is taking a long time, or if there's a lot of prints happening"
patch set: a watchdog threshold based solution. I didn't update it for a
while, but can take a look. Detaching printouts does look to me like
something we need to do.

Offloading does not work well enough with hand off (in its current
form). Because the CPU that we offloaded printing from can re-take
console_sem the very next time it calls printk().

Something like

       spin_lock(&lock)
	printk() -> 8250
	 console_unlock()
	 offload		printk_kthread
	printk() -> 8250	 console_unlock()
	printk() -> 8250	  hand off
	 console_unlock()

so if printk() -> 8250 is very slow and is under some spin_lock
(like in case of Bug 199003) then offloading doesn't make that much
difference here. Right? If we don't hand off from printk_kthread
then things do look different.

> I wonder how bad it would be to wake the printk thread whenever a
> printk() is executed. Printks really shouldn't be that common.

That didn't work quite well for me. But I must admit that my previous
printk_kthread version had some major design problems - preemption in
printk path [including printk_kthread preemption], the lack of any
reliable way for any CPU to control what printk_kthread is doing, and
so on and on. Current version, I guess, is better. But I don't do
"vprintk_emit() -> wake_up(printk_kthread)", instead I let "direct"
printks and wake_up(printk_kthread) only when current CPU spent too
much time in call_console_drivers(); it's quite easy to track given
that printk->console_unlock() is preempt_disable() now. Besides,
offloading from console_unlock() automatically handles various cases
for use. Like printk_deferred() -> IRQ -> console_unlock() and so on.

What do you think?

	-ss
Comment 9 Petr Mladek 2018-03-23 13:33:46 UTC
On Fri 2018-03-23 21:06:22, Sergey Senozhatsky wrote:
> > > Sometimes I really wish we had detached consoles. Direct
> printk()->console
> > > is nice and cool, but... we can't have it. I don't want to pressure for
> > > printk_kthread, but look at all those consoles. There are not so many
> ways
> > > out. What do you think?
> > 
> > If anything, perhaps we could have a printk thread that is triggered to
> > wake up if any print is taking a long time, or if there's a lot of
> > prints happening. And then it could try to print anything in the
> > buffer. Just the mere printk() in the thread should do the hand off.
> 
> OK. Let's think more about this. I have that "wake up printk_kthread if
> any print is taking a long time, or if there's a lot of prints happening"
> patch set: a watchdog threshold based solution. I didn't update it for a
> while, but can take a look. Detaching printouts does look to me like
> something we need to do.
> 
> Offloading does not work well enough with hand off (in its current
> form). Because the CPU that we offloaded printing from can re-take
> console_sem the very next time it calls printk().
> 
> Something like
> 
>        spin_lock(&lock)
>       printk() -> 8250
>        console_unlock()
>        offload                printk_kthread
>       printk() -> 8250         console_unlock()
>       printk() -> 8250          hand off
>        console_unlock()
> 
> so if printk() -> 8250 is very slow and is under some spin_lock
> (like in case of Bug 199003) then offloading doesn't make that much
> difference here. Right? If we don't hand off from printk_kthread
> then things do look different.

If we do not offload from printk_kthread, it might cause softlockups.
This might be solved by rescheduling from printk_kthread. The question
is if it is acceptable. I believe so because we still have cond_resched()
in console_unlock().

We should just make sure to release console_sem before rescheduling.
Then any other process might eventually continue flushing the messages.
It would help to reduce loosing messages. It might be essential when
panic(), suspend, ... happen when printk_kthread is sleeping.

Sigh, I am afraid that we still might need to disable the offload
in various situations.


> > I wonder how bad it would be to wake the printk thread whenever a
> > printk() is executed. Printks really shouldn't be that common.
> 
> That didn't work quite well for me. But I must admit that my previous
> printk_kthread version had some major design problems - preemption in
> printk path [including printk_kthread preemption], the lack of any
> reliable way for any CPU to control what printk_kthread is doing, and
> so on and on. Current version, I guess, is better. But I don't do
> "vprintk_emit() -> wake_up(printk_kthread)", instead I let "direct"
> printks and wake_up(printk_kthread) only when current CPU spent too
> much time in call_console_drivers(); it's quite easy to track given
> that printk->console_unlock() is preempt_disable() now. Besides,
> offloading from console_unlock() automatically handles various cases
> for use. Like printk_deferred() -> IRQ -> console_unlock() and so on.
> 
> What do you think?

IMHO, the wakeup from console_unlock() and related to watchdog threshold
makes sense.

I am still not sure about the reliable way for CPU control. The more
scheduler code we call the more printk_deferred() calls will be
needed.

Sigh, I am afraid that the patchset will not be trivial. We might
still need some justification for all the changes. If I get it
correctly, the reporter of this bug has not tried Steven's patches
yet.

Best Regards,
Petr
Comment 10 Wen Yang 2018-03-26 02:02:54 UTC
(In reply to sergey.senozhatsky.work from comment #8)
> Hello Steven,

> 
> > I wonder how bad it would be to wake the printk thread whenever a
> > printk() is executed. Printks really shouldn't be that common.
> 
> That didn't work quite well for me. But I must admit that my previous
> printk_kthread version had some major design problems - preemption in
> printk path [including printk_kthread preemption], the lack of any
> reliable way for any CPU to control what printk_kthread is doing, and
> so on and on. Current version, I guess, is better. But I don't do
> "vprintk_emit() -> wake_up(printk_kthread)", instead I let "direct"
> printks and wake_up(printk_kthread) only when current CPU spent too
> much time in call_console_drivers(); it's quite easy to track given
> that printk->console_unlock() is preempt_disable() now. Besides,
> offloading from console_unlock() automatically handles various cases
> for use. Like printk_deferred() -> IRQ -> console_unlock() and so on.
> 
> What do you think?
> 
>       -ss


I'm testing this patch.
Small nit:  we should define printk_kthread_need_flush_console as atomic


+/* When `true' printing thread has messages to print */
+static bool printk_kthread_need_flush_console;

...
+		if (console_loglevel != CONSOLE_LOGLEVEL_MOTORMOUTH &&
+				printk_kthread) {
+			/* Offload printing to a schedulable context. */
+			printk_kthread_need_flush_console = true;
+			wake_up_process(printk_kthread);

...
+	while (1) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!printk_kthread_need_flush_console)
+			schedule();
+
+		__set_current_state(TASK_RUNNING);
+		/*
+		 * Avoid an infinite loop when console_unlock() cannot
+		 * access consoles, e.g. because console_suspended is
+		 * true. schedule(), someone else will print the messages
+		 * from resume_console().
+		 */
+		printk_kthread_need_flush_console = false;
Comment 11 Wen Yang 2018-03-26 02:40:09 UTC
Hello Steven,

+module_param_named(synchronous, printk_sync, bool, S_IRUGO);
+MODULE_PARM_DESC(synchronous, "make printing to console synchronous");

It's depend on this kernel parameter(printk.synchronous), but this parameter is readonly.
So we must change grub files, and alse need to restart the server for changes to take effect.
If we can configure it dynamically, it will be more useful.
Comment 12 sergey.senozhatsky.work 2018-03-26 05:14:00 UTC
On (03/23/18 14:16), Petr Mladek wrote:
[..]
> If I get it correctly, the reporter of this bug has not tried
> Steven's patches yet.

It's not immediately clear.

It's not even completely clear if we are looking at "X cpus printk
1 cpu prints it all scenario" and it's not clear if hand off will be
helpful here. I'll try to explain.

What we see is:

CPU0 locked up on blkg->q->queue_lock

  [<ffffffff8130f236>] cfqg_print_rwstat_recursive+0x36/0x40
  [<ffffffff81109393>] cgroup_seqfile_show+0x73/0x80
  [<ffffffff81222b57>] ? seq_buf_alloc+0x17/0x40
  [<ffffffff8122305a>] seq_read+0x10a/0x3b0
  [<ffffffff811fe9be>] vfs_read+0x9e/0x170
  [<ffffffff811ff58f>] SyS_read+0x7f/0xe0
  [<ffffffff81697ac9>] system_call_fastpath+0x16/0x1b

blkg->q->queue_lock was held by CPU7, which was spinnig in wait_for_xmitr().

 #5 [ffff881ffb0b7548] __const_udelay at ffffffff81326678
 #6 [ffff881ffb0b7558] wait_for_xmitr at ffffffff814056e0
 #7 [ffff881ffb0b7580] serial8250_console_putchar at ffffffff814058ac
 #8 [ffff881ffb0b75a0] uart_console_write at ffffffff8140035a
 #9 [ffff881ffb0b75d0] serial8250_console_write at ffffffff814057fe
#10 [ffff881ffb0b7618] call_console_drivers.constprop.17 at ffffffff81087011
#11 [ffff881ffb0b7640] console_unlock at ffffffff810889e9
#12 [ffff881ffb0b7680] vprintk_emit at ffffffff81088df4
#13 [ffff881ffb0b76f0] dev_vprintk_emit at ffffffff81428e72
#14 [ffff881ffb0b77a8] dev_printk_emit at ffffffff81428eee
#15 [ffff881ffb0b7808] __dev_printk at ffffffff8142937e
#16 [ffff881ffb0b7818] dev_printk at ffffffff8142942d
#17 [ffff881ffb0b7888] sdev_prefix_printk at ffffffff81463771
#18 [ffff881ffb0b7918] scsi_prep_state_check at ffffffff814598e4
#19 [ffff881ffb0b7928] scsi_prep_fn at ffffffff8145992d
#20 [ffff881ffb0b7960] blk_peek_request at ffffffff812f0826
#21 [ffff881ffb0b7988] scsi_request_fn at ffffffff8145b588
#22 [ffff881ffb0b79f0] __blk_run_queue at ffffffff812ebd63
#23 [ffff881ffb0b7a08] blk_queue_bio at ffffffff812f1013     <<< acquired q->queue_lock
#24 [ffff881ffb0b7a50] generic_make_request at ffffffff812ef209
#25 [ffff881ffb0b7a98] submit_bio at ffffffff812ef351
#26 [ffff881ffb0b7af0] xfs_submit_ioend_bio at ffffffffa0146a63 [xfs]
#27 [ffff881ffb0b7b00] xfs_submit_ioend at ffffffffa0146b31 [xfs]
#28 [ffff881ffb0b7b40] xfs_vm_writepages at ffffffffa0146e18 [xfs]
#29 [ffff881ffb0b7bb8] do_writepages at ffffffff8118da6e
#30 [ffff881ffb0b7bc8] __writeback_single_inode at ffffffff812293a0
#31 [ffff881ffb0b7c08] writeback_sb_inodes at ffffffff8122a08e
#32 [ffff881ffb0b7cb0] __writeback_inodes_wb at ffffffff8122a2ef
#33 [ffff881ffb0b7cf8] wb_writeback at ffffffff8122ab33
#34 [ffff881ffb0b7d70] bdi_writeback_workfn at ffffffff8122cb2b
#35 [ffff881ffb0b7e20] process_one_work at ffffffff810a851b
#36 [ffff881ffb0b7e68] worker_thread at ffffffff810a9356
#37 [ffff881ffb0b7ec8] kthread at ffffffff810b0b6f
#38 [ffff881ffb0b7f50] ret_from_fork at ffffffff81697a18


Given how slow serial8250_console_putchar()->wait_for_xmitr() can be -
10ms of delay for every char - it's possible that we had no concurrent
printk()-s from other CPUs. So may be we had just one printing CPU,
and several CPUs spinning on a spin_lock which was owned by the printing
CPU.

So that's why printk_deferred() helped here. It simply detached 8250
and made spin_lock critical secrtion to be as fast as printk->log_store().


But here comes the tricky part. Suppose that we:
a) have at least two CPUs that call printk concurrently
b) have hand off enabled


Now, what will happen if we have something like this

CPU0					CPU1				CPU2
					spin_lock(queue_lock)
					 printk				printk
cfqg_print_rwstat_recursive()		  serial8250
 spin_lock(queue_lock)			 printk				 serial8250
 					  serial8250			printk
									 serial8250


I suspect that handoff may not be very helpful. CPU1 and CPU2 will wait for
each to finish serial8250 and to hand off printing to each other. So CPU1
will do 2 serial8250 invocations to printk its messages, and in between it
will spin waiting for CPU2 to do its printk->serial8250 and to handoff
printing to CPU1. The problem is that CPU1 will be under spin_lock() all
that time, so CPU0 is going to suffer just like before.

Opinions?

	-ss
Comment 13 sergey.senozhatsky.work 2018-03-26 05:18:52 UTC
Cc-ing the kernel list and printk people.

Wen Yang, any chance we can switch to email? Bugzilla is not very handful.


On (03/26/18 02:40), bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=199003
> 
> --- Comment #11 from Wen Yang (wen.yang99@zte.com.cn) ---
> Hello Steven,
> 
> +module_param_named(synchronous, printk_sync, bool, S_IRUGO);
> +MODULE_PARM_DESC(synchronous, "make printing to console synchronous");
> 
> It's depend on this kernel parameter(printk.synchronous), but this parameter
> is
> readonly.
> So we must change grub files, and alse need to restart the server for changes
> to take effect.
> If we can configure it dynamically, it will be more useful.

So you are testing printk_kthread now, right? May I ask why?
Did Steven's patch help?

	-ss
Comment 14 Petr Mladek 2018-03-26 09:26:48 UTC
On Mon 2018-03-26 14:12:22, Sergey Senozhatsky wrote:
> Given how slow serial8250_console_putchar()->wait_for_xmitr() can be -
> 10ms of delay for every char - it's possible that we had no concurrent
> printk()-s from other CPUs. So may be we had just one printing CPU,
> and several CPUs spinning on a spin_lock which was owned by the printing
> CPU.
> 
> So that's why printk_deferred() helped here. It simply detached 8250
> and made spin_lock critical secrtion to be as fast as printk->log_store().
>
> But here comes the tricky part. Suppose that we:
> a) have at least two CPUs that call printk concurrently
> b) have hand off enabled
> 
> 
> Now, what will happen if we have something like this
> 
> CPU0                                  CPU1                            CPU2
>                                       spin_lock(queue_lock)
>                                        printk                         printk
> cfqg_print_rwstat_recursive()           serial8250
>  spin_lock(queue_lock)                         printk                        
>   serial8250
>                                         serial8250                    printk
>                                                                       
> serial8250
> 
> 
> I suspect that handoff may not be very helpful. CPU1 and CPU2 will wait for
> each to finish serial8250 and to hand off printing to each other. So CPU1
> will do 2 serial8250 invocations to printk its messages, and in between it
> will spin waiting for CPU2 to do its printk->serial8250 and to handoff
> printing to CPU1. The problem is that CPU1 will be under spin_lock() all
> that time, so CPU0 is going to suffer just like before.
> 
> Opinions?

It would help if Wen Yang could provide the entire log and also try the
Steven's patches. Otherwise, we are too speculating.

The 10ms delay-per-char looks scarry. But if I get it correctly, it
would happen only if we hit the deadline for each character. So,
a question is if the serial port is really that busy and if so why.

Also I wonder how many messages were actually printed under the
queue_lock. printk_deferred() seems to help but it is an immediate
offload. I wonder if the offload from console_unlock() would actually
help. It might fail to detect the critic situation if too many lines
are printed inside the lock and console_unlock() is called for each
line separately. Also it would fail when only single line caused that
big delay.

Best Regards,
Petr
Comment 15 Wen Yang 2018-03-26 10:23:16 UTC
Created attachment 274943 [details]
vmcore-dmesg

vmcore-dmesg
Comment 16 Petr Mladek 2018-03-26 13:05:39 UTC
On Mon 2018-03-26 10:23:16, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=199003
> 
> --- Comment #15 from Wen Yang (wen.yang99@zte.com.cn) ---
> Created attachment 274943 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=274943&action=edit
> vmcore-dmesg

The log looks like:

[  545.616827] sd 0:1:0:0: rejecting I/O to offline device
[  545.622695] sd 0:1:0:0: rejecting I/O to offline device
...
[  546.247802] sd 0:1:0:0: rejecting I/O to offline device
[  546.247816] sd 0:1:0:0: rejecting I/O to offline device
[  546.247892] sd 0:1:0:0: [sda] CDB: Write(10) 2a 00 0c 0d 2a 58 00 00 10 00
[  546.247893] blk_update_request: I/O error, dev sda, sector 202189400
[  552.960713] ------------[ cut here ]------------
[  552.960718] WARNING: at net/sched/sch_generic.c:297 dev_watchdog+0x276/0x280()
[  552.960719] NETDEV WATCHDOG: ens12f0 (ixgbe): transmit queue 8 timed out

It prints more than 3000 times the same line within 1 sec. It is > 190kB:

$> grep "sd 0:1:0:0: rejecting I/O to offline device" /prace/vmcore-dmesg.txt  | wc -l
3239
$> grep "sd 0:1:0:0: rejecting I/O to offline device" /prace/vmcore-dmesg.txt  | wc -c
191101

Therefore the serial console is actually pretty fast. It seems that the deadline
10ms-per-character is not in the game here.

IMHO, printing the same message so many times is useless. Therefore
some throttling would make sense. If we want to keep sdev_printk(),
it might look like:


diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c84f931388f2..109490eae480 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1287,6 +1287,9 @@ static int
 scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
 {
 	int ret = BLKPREP_OK;
+	static DEFINE_RATELIMIT_STATE(_rs,
+				      DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	/*
 	 * If the device is not in running state we will reject some
@@ -1301,8 +1304,9 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
 			 * commands.  The device must be brought online
 			 * before trying any recovery commands.
 			 */
-			sdev_printk(KERN_ERR, sdev,
-				    "rejecting I/O to offline device\n");
+			if (__ratelimit(&_rs))
+				sdev_printk(KERN_ERR, sdev,
+					    "rejecting I/O to offline device\n");
 			ret = BLKPREP_KILL;
 			break;
 		case SDEV_DEL:
@@ -1802,8 +1806,13 @@ static void scsi_request_fn(struct request_queue *q)
 			break;
 
 		if (unlikely(!scsi_device_online(sdev))) {
-			sdev_printk(KERN_ERR, sdev,
-				    "rejecting I/O to offline device\n");
+			static DEFINE_RATELIMIT_STATE(_rs,
+					DEFAULT_RATELIMIT_INTERVAL,
+					DEFAULT_RATELIMIT_BURST);
+
+			if (__ratelimit(&_rs))
+				sdev_printk(KERN_ERR, sdev,
+					    "rejecting I/O to offline device\n");
 			scsi_kill_request(req, q);
 			continue;
 		}

It is possible that throthling one of the above functions would
be enough. I am just not sure which one.

Another question is if the scsi code is sane. It seems to me that it
does a lot of cycles/code under the spin lock even without the printk
calls. By other words, I wonder if we could spend too much time in this
code even without all the printk's.

Best Regards,
Petr
Comment 17 sergey.senozhatsky.work 2018-03-27 00:57:19 UTC
On (03/26/18 13:05), bugzilla-daemon@bugzilla.kernel.org wrote:
[..]
> It prints more than 3000 times the same line within 1 sec. It is > 190kB:
> 
> $> grep "sd 0:1:0:0: rejecting I/O to offline device" /prace/vmcore-dmesg.txt 
> | wc -l
> 3239
> $> grep "sd 0:1:0:0: rejecting I/O to offline device" /prace/vmcore-dmesg.txt 
> | wc -c
> 191101
> 
> Therefore the serial console is actually pretty fast. It seems that the
> deadline
> 10ms-per-character is not in the game here.

As the name suggests this is dmesg - content of logbuf. We can't tell
anything about serial consoles speed from it.

	-ss
Comment 18 sergey.senozhatsky.work 2018-03-27 01:02:43 UTC
On (03/26/18 09:26), bugzilla-daemon@bugzilla.kernel.org wrote:
> I wonder if the offload from console_unlock() would actually help. It
> might fail to detect the critic situation if too many lines are printed
> inside the lock and console_unlock() is called for each line separately.

No. We are in preemption disabled context, there is no way for the current
offloading code to mess up in this situation.

> Also it would fail when only single line caused that big delay.

Yes.

	-ss
Comment 19 sergey.senozhatsky.work 2018-03-27 01:32:11 UTC
On (03/27/18 00:57), bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #17 from sergey.senozhatsky.work@gmail.com ---
> On (03/26/18 13:05), bugzilla-daemon@bugzilla.kernel.org wrote:
> [..]
> > It prints more than 3000 times the same line within 1 sec. It is > 190kB:
> > 
> > $> grep "sd 0:1:0:0: rejecting I/O to offline device"
> /prace/vmcore-dmesg.txt 
> > | wc -l
> > 3239
> > $> grep "sd 0:1:0:0: rejecting I/O to offline device"
> /prace/vmcore-dmesg.txt 
> > | wc -c
> > 191101
> > 
> > Therefore the serial console is actually pretty fast. It seems that the
> > deadline
> > 10ms-per-character is not in the game here.
> 
> As the name suggests this is dmesg - content of logbuf. We can't tell
> anything about serial consoles speed from it.

Hm, that dmesg output has backtraces only for CPU16 and CPU8.
Neither CPU16 nor CPU8 were in console_unlock()->serial8250()

	-ss
Comment 20 Wen Yang 2018-03-27 02:59:45 UTC
Created attachment 274951 [details]
runq
Comment 21 Wen Yang 2018-03-27 03:00:32 UTC
Created attachment 274953 [details]
foreach_bt
Comment 22 Petr Mladek 2018-03-27 08:36:34 UTC
On Tue 2018-03-27 00:57:19, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=199003
> 
> --- Comment #17 from sergey.senozhatsky.work@gmail.com ---
> On (03/26/18 13:05), bugzilla-daemon@bugzilla.kernel.org wrote:
> > Therefore the serial console is actually pretty fast. It seems that the
> > deadline
> > 10ms-per-character is not in the game here.
> 
> As the name suggests this is dmesg - content of logbuf. We can't tell
> anything about serial consoles speed from it.

Grrr, you are right. It would be interesting to see the output from
the serial port as well.

Anyway, it does not change the fact that printing so many same lines is
useless. The throttling still would make sense and probably would
solve the problem.

Best Regards,
Petr
Comment 23 Petr Mladek 2018-03-27 09:06:34 UTC
On Tue 2018-03-27 01:02:43, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=199003
> 
> --- Comment #18 from sergey.senozhatsky.work@gmail.com ---
> On (03/26/18 09:26), bugzilla-daemon@bugzilla.kernel.org wrote:
> > I wonder if the offload from console_unlock() would actually help. It
> > might fail to detect the critic situation if too many lines are printed
> > inside the lock and console_unlock() is called for each line separately.
> 
> No. We are in preemption disabled context, there is no way for the current
> offloading code to mess up in this situation.

In the version that I see, printing_start_ts is restarted each time
console_unlock() is called. Therefore it would not trigger the offload
if more printk's are called without rescheduling, each calls console_unlock()
for its own line, and each console_unlock() finishes in a reasonable
limit. Or do I miss something?

Of course, it can be solved rather easily.

Best Regards,
Petr
Comment 24 Jiang Biao 2018-03-27 09:12:02 UTC
(In reply to Petr Mladek from comment #14)
> On Mon 2018-03-26 14:12:22, Sergey Senozhatsky wrote:
> > Given how slow serial8250_console_putchar()->wait_for_xmitr() can be -
> > 10ms of delay for every char - it's possible that we had no concurrent
> > printk()-s from other CPUs. So may be we had just one printing CPU,
> > and several CPUs spinning on a spin_lock which was owned by the printing
> > CPU.
> > 
> > So that's why printk_deferred() helped here. It simply detached 8250
> > and made spin_lock critical secrtion to be as fast as printk->log_store().
> >
> > But here comes the tricky part. Suppose that we:
> > a) have at least two CPUs that call printk concurrently
> > b) have hand off enabled
> > 
> > 
> > Now, what will happen if we have something like this
> > 
> > CPU0                                  CPU1                            CPU2
> >                                       spin_lock(queue_lock)
> >                                        printk                        
> printk
> > cfqg_print_rwstat_recursive()           serial8250
> >  spin_lock(queue_lock)                         printk                       
> >   serial8250
> >                                         serial8250                   
> printk
> >                                                                       
> > serial8250
> > 
> > 
> > I suspect that handoff may not be very helpful. CPU1 and CPU2 will wait for
> > each to finish serial8250 and to hand off printing to each other. So CPU1
> > will do 2 serial8250 invocations to printk its messages, and in between it
> > will spin waiting for CPU2 to do its printk->serial8250 and to handoff
> > printing to CPU1. The problem is that CPU1 will be under spin_lock() all
> > that time, so CPU0 is going to suffer just like before.
> > 
> > Opinions?
> 
> It would help if Wen Yang could provide the entire log and also try the
> Steven's patches. Otherwise, we are too speculating.
> 
> The 10ms delay-per-char looks scarry. But if I get it correctly, it
> would happen only if we hit the deadline for each character. So,
> a question is if the serial port is really that busy and if so why.
> 
> Also I wonder how many messages were actually printed under the
> queue_lock. printk_deferred() seems to help but it is an immediate
> offload. I wonder if the offload from console_unlock() would actually
> help. It might fail to detect the critic situation if too many lines
> are printed inside the lock and console_unlock() is called for each
> line separately. Also it would fail when only single line caused that
> big delay.
> 
> Best Regards,
> Petr

It's hard to reproduce the issue on physical env, We are trying to figure out a way to reproduce manually by adding printks in kernel. It'll need some time.

Thank you all for the reply.

Regards,
Jiang
Comment 25 sergey.senozhatsky.work 2018-03-27 10:37:49 UTC
I'll Cc blockdev

On (03/27/18 08:36), bugzilla-daemon@bugzilla.kernel.org wrote:
> > --- Comment #17 from sergey.senozhatsky.work@gmail.com ---
> > On (03/26/18 13:05), bugzilla-daemon@bugzilla.kernel.org wrote:
> > > Therefore the serial console is actually pretty fast. It seems that the
> > > deadline
> > > 10ms-per-character is not in the game here.
> > 
> > As the name suggests this is dmesg - content of logbuf. We can't tell
> > anything about serial consoles speed from it.
> 
> Grrr, you are right. It would be interesting to see the output from
> the serial port as well.
> 
> Anyway, it does not change the fact that printing so many same lines is
> useless. The throttling still would make sense and probably would
> solve the problem.

You are right.

Looking at backtraces (https://bugzilla.kernel.org/attachment.cgi?id=274953&action=edit)
there *probably* was just one CPU doing all printk-s and all printouts. And
there was one CPU waiting for that printing CPU to unlock the queue spin_lock.

The printing CPU was looping in scsi_request_fn() picking up requests
and calling sdev_printk() for each of them, because the device was
offline. Given that serial console is not very fast, that we called
serial console under queue spin_lock and the number of printks called,
it was enough to lockup the CPU which was spining on queue spin_lock and
to hard lockup the system.

scsi_request_fn() does unlock the queue lock later, but not in that
!scsi_device_online(sdev) error case.

scsi_request_fn()
{
	for (;;) {
		int rtn;
		/*
		 * get next queueable request.  We do this early to make sure
		 * that the request is fully prepared even if we cannot
		 * accept it.
		 */
		req = blk_peek_request(q);
		if (!req)
			break;

		if (unlikely(!scsi_device_online(sdev))) {
			sdev_printk(KERN_ERR, sdev,
				    "rejecting I/O to offline device\n");
			scsi_kill_request(req, q);
			continue;
			^^^^^^^^^ still under spinlock
		}
}

I'd probably just unlock/lock queue lock, rather than ratelimit printk-s,
before `continue'. Dunno.

James, Martin, what do you think?

	-ss
Comment 26 Wen Yang 2018-03-27 11:02:32 UTC
On the other hand,queue_lock is big.
This code:

void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
		       u64 (*prfill)(struct seq_file *,
				     struct blkg_policy_data *, int),
		       const struct blkcg_policy *pol, int data,
		       bool show_total)
{
	struct blkcg_gq *blkg;
	u64 total = 0;

	rcu_read_lock();
	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
		spin_lock_irq(blkg->q->queue_lock);         --------->acquired q->queue_lock, but only call blkcg_policy_enabled
		if (blkcg_policy_enabled(blkg->q, pol))
			total += prfill(sf, blkg->pd[pol->plid], data);
		spin_unlock_irq(blkg->q->queue_lock);
	}
	rcu_read_unlock();

	if (show_total)
		seq_printf(sf, "Total %llu\n", (unsigned long long)total);
}

static bool blkcg_policy_enabled(struct request_queue *q,
				 const struct blkcg_policy *pol)
{
	return pol && test_bit(pol->plid, q->blkcg_pols);  
}

But blkcg_policy_enabled just call test_bit.
We may use some atomic operation to replace acquiring queue_lock.


Regards,
Wen
Comment 27 Sergey Senozhatsky 2018-03-27 13:29:13 UTC
On (03/27/18 09:06), bugzilla-daemon@bugzilla.kernel.org wrote:
> > --- Comment #18 from sergey.senozhatsky.work@gmail.com ---
> > On (03/26/18 09:26), bugzilla-daemon@bugzilla.kernel.org wrote:
> > > I wonder if the offload from console_unlock() would actually help. It
> > > might fail to detect the critic situation if too many lines are printed
> > > inside the lock and console_unlock() is called for each line separately.
> > 
> > No. We are in preemption disabled context, there is no way for the current
> > offloading code to mess up in this situation.
> 
> In the version that I see, printing_start_ts is restarted each time
> console_unlock() is called. Therefore it would not trigger the offload
> if more printk's are called without rescheduling, each calls console_unlock()
> for its own line, and each console_unlock() finishes in a reasonable
> limit. Or do I miss something?

I think the most recent version that I had have accounted call_console_drivers()
on a per-CPU basis, and console_unlock() is preempt_disabled. The counter
(amount of time that CPU spent in console drivers) restarts once a
different task executes console_unlock() on that CPU. Something like that.

	-ss