Bug 199077

Summary: [Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime and boot failures with blk_mq_terminate_expired in backtrace
Product: IO/Storage Reporter: Martin Steigerwald (Martin)
Component: Block LayerAssignee: Tejun Heo (tj)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bvanassche, jbuchert+kbugs
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.16-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: configuration of the currently running 4.16-rc4 kernel that shows these issues
blk_mq_terminate_expired
4.16.1 quiesce + timeout patch boot failure
4.16.2 backtrace with ming v4 patches
4.16.2 backtrace with ming v4 patches [2/2]

Description Martin Steigerwald 2018-03-11 08:17:02 UTC
Since 4.16-rc4 (upgraded from 4.15.2 which worked) I have an issue
with SMART checks occassionally failing like this:

smartd[28017]: Device: /dev/sdb [SAT], is in SLEEP mode, suspending checks 
udisksd[24408]: Error performing housekeeping for drive /org/freedesktop/UDisks2/drives/INTEL_SSDSA2CW300G3_[…]: Error updating SMART
data: Error sending ATA command CHECK POWER MODE: Unexpected sense data returned:#0120000: 0e 09 0c 00  00 00 ff 00  00 00 00 00  00 00 50 00    ..............P.#0120010: 
00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00    ................#012 (g-io-error-quark, 0) 
merkaba udisksd[24408]: Error performing housekeeping for drive /org/freedesktop/UDisks2/drives/Crucial_CT480M500SSD3_[…]: Error updating SMART dat
a: Error sending ATA command CHECK POWER MODE: Unexpected sense data returned:#0120000: 01 00 1d 00  00 00 0e 09  0c 00 00 00  ff 00 00 00    ................#0120010: 00 0
0 00 00  50 00 00 00  00 00 00 00  00 00 00 00    ....P...........#012 (g-io-error-quark, 0)

(Intel SSD is connected via SATA, Crucial via mSATA in a ThinkPad T520)

However when I then check manually with smartctl -a | -x | -H the device
reports SMART data just fine.

Also at some boot attempts the boot hangs with a message "could not connect
to lvmetad".

I wonder whether that has to do with the new SATA LPM policy stuff, but as
I had issues with

 3 => Medium power with Device Initiated PM enabled

(machine did not boot, which could also have been caused by me accidentally
removing all TCP/IP network support in the kernel with that setting)

I set it back to

CONFIG_SATA_MOBILE_LPM_POLICY=0

(firmware settings)

Only other significant change I am aware of is that I switched from SLAB
to SLUB allocator as Debian did with their kernels recently I think.

I attach the complete configuration as xz.

Please understand that I am not into doing a bisect as it can take quite a
a while for the issue to appear and I will be holding a Linux training next
week. If you have any other suggestions, please tell.

I found a thread in LKML about another Crucial SSD not working with more
aggressive LPM settings, yet my current 4.16-rc4 kernel runs with LPM policy
0 which should be safe ([PATCH] libata: Apply NOLPM quirk to Crucial MX100 512GB SSDs).

Also about  3 => Medium power with Device Initiated PM enabled I am not yet
sure which of the both SSDs may cause trouble.

Also posted on mailing list.
Comment 1 Martin Steigerwald 2018-03-11 08:21:32 UTC
Created attachment 274651 [details]
configuration of the currently running 4.16-rc4 kernel that shows these issues
Comment 2 Martin Steigerwald 2018-03-11 08:31:39 UTC
[Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime and could not connect to lvmetad at some boot attempts
https://lkml.org/lkml/2018/3/11/15

Message-ID: <27165802.vQ9JbjrmvU@merkaba>
Comment 3 Martin Steigerwald 2018-03-13 13:01:33 UTC
Created attachment 274703 [details]
blk_mq_terminate_expired

This is a screenshot of a boot failure from tomorrow morning. I had this another time another day but was not in a good situation to take a screenshot.

So this hints at some issue with blk-mq. I do use blk-mq with BFQ.
Comment 4 Martin Steigerwald 2018-04-10 17:28:24 UTC
The second issue is still unfixed as of 4.16 + [PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}() vom Bart Van Asche, which Jens Axboe accepted¹.

[1] https://patchwork.kernel.org/patch/10294287/

Currently compiling 4.16.1, but I do not expect a change, as there is nothing about blk-mq subsystem in the changelog as far as I saw.
Comment 5 Martin Steigerwald 2018-04-10 17:29:42 UTC
Forget the comment about the second issue. This bug report is just about what I believe is one issue. I just copied from a mail I will send to the mailing list thread I mentioned in comment 2.
Comment 6 Bart Van Assche 2018-04-10 17:35:54 UTC
This patch may help to get rid of the boot failure: "blk-mq: Directly schedule q->timeout_work when aborting a request" (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1653932.html).
Comment 7 Martin Steigerwald 2018-04-10 18:45:38 UTC
Thanks, Bart. Will test with a 4.16.1 with [PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}() + blk-mq: Directly schedule q->timeout_work when aborting a request on top.

So the "Error updating SMART data" thing I reported here as well may still be a different issue, but I test whether that might have been fixed by it, as well.
Comment 8 Martin Steigerwald 2018-04-10 20:42:36 UTC
Created attachment 275279 [details]
4.16.1 quiesce + timeout patch boot failure

4.16.1 with [PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}() + blk-mq: Directly schedule q->timeout_work when aborting a request on top fails as well.
Comment 9 Bart Van Assche 2018-04-10 20:50:20 UTC
The "RIP: scsi_times_out+0x17" crash looks familiar. Please try patch "[PATCH v4] blk-mq: Fix race conditions in request timeout handling" (https://marc.info/?l=linux-block&m=152332423328778). Note: although not everyone agrees about the description of that patch I think there is agreement that that patch itself works reliably.
Comment 10 Martin Steigerwald 2018-04-10 21:06:55 UTC
Thank you, Bart. Will try tomorrow. On top of the two other patches or as a replacement for them?
Comment 11 Bart Van Assche 2018-04-10 21:30:59 UTC
Please apply that patch on top of the other two patches.
Comment 12 Martin Steigerwald 2018-04-11 09:11:23 UTC
Bart, 4.16.1 with all the three patches did about 15 reboots without a hang. So I think it is good to go. As its a race that only happens sometimes I am not yet completely sure, so I am going to test this kernel during the next days with my regular suspend to ram / hibernate to disk and resume workflow. Thank you for your work on this.
Comment 13 Martin Steigerwald 2018-04-16 13:10:10 UTC
Created attachment 275399 [details]
4.16.2 backtrace with ming v4 patches

Ming, a 4.16.2 with the patches:

'[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a request.mbox'
'[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}().mbox'
'[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s timeout isn'\''t handled.mbox'
'[PATCH V4 2_2] blk-mq_fix race between complete and BLK_EH_RESET_TIMER.mbox'

hung on boot 3 out of 4 times.

I attach backtraces from two different attempts.
Comment 14 Martin Steigerwald 2018-04-16 13:10:34 UTC
Created attachment 275401 [details]
4.16.2 backtrace with ming v4 patches [2/2]
Comment 15 Martin Steigerwald 2018-05-11 08:57:35 UTC
This is fixed meanwhile by f4560231ec42092c6662acccabb28c6cac9f5dfb which went into 4.17 and is marked to be backported/applied to various backport kernels.

commit f4560231ec42092c6662acccabb28c6cac9f5dfb
Author: Jianchao Wang <[…]>
Date:   Tue Apr 17 11:46:20 2018 +0800

    blk-mq: start request gstate with gen 1
    
    rq->gstate and rq->aborted_gstate both are zero before rqs are
    allocated. If we have a small timeout, when the timer fires,
    there could be rqs that are never allocated, and also there could
    be rq that has been allocated but not initialized and started. At
    the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
    the blk_mq_terminate_expired will identify the rq is timed out and
    invoke .timeout early.
    
    For scsi, this will cause scsi_times_out to be invoked before the
    scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
    the moment, then we will get crash.