Bug 15396

Summary: resume after hibernate: /dev/sdb returns as /dev/sdd
Product: Drivers Reporter: Oleksandr Yermolenko (yaa.bta)
Component: OtherAssignee: Tejun Heo (tj)
Status: CLOSED CODE_FIX    
Severity: normal CC: rjw, tj, yaa.bta
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32.10, 2.6.32.7, 2.6.31.12 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: normal bootup and bad resume after hibernate for 2.6.32.7
normal bootup with sequence of 5 successful hibernate/resume cycles for 2.6.32.7
kernel config 2.6.32.7
normal bootup and bad resume after hibernate for 2.6.31.12
kernel config 2.6.31.12
reval-debug.patch
normal bootup and bad resume after hibernate for 2.6.32.7 with reval-debug.patch
reval-debug-1.patch
sd_* calls tracing - the patch
sd_* calls tracing - dmesg output
sd_* calls tracing - screenshot before suspend's poweroff with three unlogged sd_resume() calls
normal bootup and bad resume after hibernate for 2.6.32.7 with reval-debug-1.patch
unlock-hpa-if-device-shrunk.patch
normal bootup with sequence of 3 successful hibernate/resume cycles for 2.6.32.11 with unlock-hpa-if-device-shrunk.patch

Description Oleksandr Yermolenko 2010-02-25 14:58:14 UTC
Created attachment 25217 [details]
normal bootup and bad resume after hibernate for 2.6.32.7

Hard drive (Seagate ST3500320AS with SD1A firmware) sometimes changes its name from /dev/sdb to /dev/sdd after resuming from hibernate. The frequency of the event is low (about one event per 10 hibernate/resume cycles). Other HDDs (SAMSUNG HD321KJ (/dev/sda) and WDC WD15EARS-00Z5B1 (/dev/sdc)) never change their /dev/sdX names after resume (I have not noticed that yet).

I have experienced this problem on kernels 2.6.31.12 and 2.6.32.7 on Debian squeeze amd64. Kernels are from kernel.org and built with make-kpkg.

Kernel configs and logs are attached.

Note that numbers in "n_sectors mismatch" message:
  n_sectors mismatch 976773168 != 976771055
are the same as in "HPA detected" message:
  HPA detected: current 976771055, native 976773168

I'll attach pure dmesg output as soon as I'll be lucky in reproducing this bug.

Thanks.
Comment 1 Oleksandr Yermolenko 2010-02-25 15:01:48 UTC
Created attachment 25218 [details]
normal bootup with sequence of 5 successful hibernate/resume cycles for 2.6.32.7
Comment 2 Oleksandr Yermolenko 2010-02-25 15:02:35 UTC
Created attachment 25219 [details]
kernel config 2.6.32.7
Comment 3 Oleksandr Yermolenko 2010-02-25 15:03:20 UTC
Created attachment 25220 [details]
 normal bootup and bad resume after hibernate for 2.6.31.12
Comment 4 Oleksandr Yermolenko 2010-02-25 15:03:56 UTC
Created attachment 25221 [details]
kernel config 2.6.31.12
Comment 5 Tejun Heo 2010-02-26 01:37:23 UTC
Created attachment 25229 [details]
reval-debug.patch

Can you please apply this patch, trigger the bug and then report the kernel log?

Thanks.
Comment 6 Oleksandr Yermolenko 2010-02-26 20:19:22 UTC
Created attachment 25249 [details]
normal bootup and bad resume after hibernate for 2.6.32.7 with reval-debug.patch

In brief:
without the patch we have
  n_sectors mismatch 976773168 != 976771055
with patched kernel we have
  n_sectors mismatch dev->n_sectors=976771055 dev->n_native=976773168 n_sectors=976773168
Comment 7 Tejun Heo 2010-03-02 07:54:43 UTC
Created attachment 25306 [details]
reval-debug-1.patch

Oh... looks like n_native_sectors has changed.  Can you please try this one and report the log?

Thanks.
Comment 8 Oleksandr Yermolenko 2010-03-05 22:43:39 UTC
I am trying to trigger the bug using kernel with reval-debug-1.patch but still with no luck.
To speedup the procedure I was playing with different patches and had revealed another (probably related) issue.
When doing suspend to disk, sd_resume() routine is called also *before* poweroff. This call is not logged, i.e. after suspend system lives during some time and does sd_resume() for each disk.

Is this correct behavior?

I have attached the patch for sd.c, dmesg output after resume and screenshot (taken just before suspend's poweroff) showing three sd_suspend() calls followed by three sd_resume() calls. These sd_resume() calls do not appear in the system logs (look at the times in the first column) and appear only on the screen (printed with printk(KERN_EMERG, ... ) ).

P.S.: I'll post kernel log with reval-debug-1.patch here as soon as I'll be lucky in triggering the bug again.
Comment 9 Oleksandr Yermolenko 2010-03-05 22:45:25 UTC
Created attachment 25373 [details]
sd_* calls tracing - the patch
Comment 10 Oleksandr Yermolenko 2010-03-05 22:46:28 UTC
Created attachment 25374 [details]
sd_* calls tracing - dmesg output
Comment 11 Oleksandr Yermolenko 2010-03-05 22:48:53 UTC
Created attachment 25375 [details]
sd_* calls tracing - screenshot before suspend's poweroff with three unlogged sd_resume() calls
Comment 12 Tejun Heo 2010-03-09 06:42:10 UTC
Yeap, that's how the disk image is written.  It first freezes the controller by calling suspend, take a snapshot in memory, resumes the device and writes it out and then shut it down again by calling sd_suspend again.  No luck reproducing the bug yet?
Comment 13 Oleksandr Yermolenko 2010-03-09 19:52:14 UTC
Created attachment 25434 [details]
normal bootup and bad resume after hibernate for 2.6.32.7 with reval-debug-1.patch

Today I was lucky :)
The bug have been reproduced.

In brief with reval-debug-1.patch we have:
n_sectors mismatch dev->n_sectors=976771055 dev->n_native=976773168 n_sectors=976773168 n_native_sectors=976773168

It seems to me would be great to write code that will produce a sequence of usual suspend/resume procedures for the specific disk (even with hardcoded name). In case the bug is not related to physical poweroff it may dramatically speedup procedure of reproducing.

Need to mention that the drive ST3500320AS belongs to 7200.11 series, which is famous by microcode bugs. See http://en.wikipedia.org/wiki/Seagate_Barracuda#7200.11

Thanks.

P.S.: Thanks for the explanation about suspend to disk procedure. Marking corresponded attachments as obsolete.
Comment 14 Oleksandr Yermolenko 2010-03-16 10:57:47 UTC
(In reply to comment #7)

> Oh... looks like n_native_sectors has changed. 

Tejun, if you decide that there is a high probability of the firmware bug, we could wait until it will be reproduced with HDD of another model.
I find it very suspicious that I have not experienced the bug with another two HDDs in my system yet.

Thanks.
Comment 15 Tejun Heo 2010-03-16 22:55:29 UTC
I suspect firmware or some other weird problem but if you can reproduce the problem, please post the kernel log with the patch applied.  Thanks.
Comment 16 Oleksandr Yermolenko 2010-03-17 00:18:32 UTC
(In reply to comment #15)
> I suspect firmware or some other weird problem but if you can reproduce the
> problem, please post the kernel log with the patch applied.  Thanks.

Already done. Though this kernel log is not inside reply to the appropriate message. I have not figured out how to add attachments inside replies using bugzilla web interface.

Thanks.
Comment 17 Oleksandr Yermolenko 2010-03-28 09:49:22 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > I suspect firmware or some other weird problem but if you can reproduce the
> > problem, please post the kernel log with the patch applied.  Thanks.
> 
> Already done. Though this kernel log is not inside reply to the appropriate
> message. I have not figured out how to add attachments inside replies using
> bugzilla web interface.

I mean that requested kernel log (with the reval-debug-1.patch applied) is inside an ordinary comment (comment #13), not inside a reply.

Update: Recently I have experienced this bug twice on 2.6.32.10. Nothing special. Four numbers produced by debug message remain the same as for 2.6.32.7.

Thanks.
Comment 18 Tejun Heo 2010-03-29 03:56:19 UTC
Ah... sorry, I missed that.

Okay, your BIOS is behaving in a quite strange way.  The BIOS is not configuring HPA on boot but just occasionally it configures HPA during resume.  This makes that the disk appears to be clipped on resume so the kernel can't use the device as-is.  It usually is the other way around.  BIOS sets up HPA during boot but forgets to do it during resume which the kernel accepts as device getting larger usually doesn't pose actual problem.  I'll give a shot at workaround for this case too but for the time being libata.ignore_hpa=1 should fix your problem.

Thanks.
Comment 19 Tejun Heo 2010-03-29 04:18:13 UTC
Created attachment 25747 [details]
unlock-hpa-if-device-shrunk.patch

Can you please test this patch without specifying libata.ignore_hpa=1?  With this patch applied, the kernel will report

  ataP.DD: old n_sectors matches native, probably late HPA lock, will try to unlock HPA

and unlock HPA and you should be able to keep the device.

Thanks.
Comment 20 Oleksandr Yermolenko 2010-04-04 20:03:35 UTC
Created attachment 25853 [details]
normal bootup with sequence of 3 successful hibernate/resume cycles for 2.6.32.11 with unlock-hpa-if-device-shrunk.patch
Comment 21 Oleksandr Yermolenko 2010-04-04 20:29:25 UTC
(In reply to comment #19)
> unlock-hpa-if-device-shrunk.patch
> 
> Can you please test this patch without specifying libata.ignore_hpa=1?  
>

dmesg output of normal bootup with sequence of 3 successful hibernate/resume cycles is uploaded.

In brief, messages
 ataX.DD: n_sectors mismatch 976773168 != 976771055
 ataX.DD: old n_sectors matches native, probably late HPA lock, will try to unlock HPA
 ataX.DD: revalidation failed (errno=-5)
are printed during only the first hibernate/resume cycle.

I will continue to test this patch and report if I get any more problems.

Thanks for your efforts!
Comment 22 Tejun Heo 2010-04-05 01:15:05 UTC
Cool, will forward patch upstream.  Thanks for testing.
Comment 23 Rafael J. Wysocki 2011-01-16 22:25:21 UTC
Is the problem still present in 2.6.37?
Comment 24 Oleksandr Yermolenko 2011-01-17 09:51:06 UTC
(In reply to comment #23)
> Is the problem still present in 2.6.37?

The most recent version of the kernel I have tested is 2.6.35.7. There was no problem. Now I am happy with debian stock 2.6.32 kernel since debian maintainers have backported unlock-hpa-if-device-shrunk.patch to it. Assuming the patch is still incorporated to 2.6.37 there will be no problem too, although I have not tested it yet (and probably can't test it in easy way because of NVIDIA blob version I am using). But I'll try.
Comment 25 Rafael J. Wysocki 2011-01-17 22:23:40 UTC
OK, thanks for the information.  I'm marking this as fixed, please file
a separate bug if it is broken again in a later kernel.