Created attachment 25217 [details]
normal bootup and bad resume after hibernate for 22.214.171.124
Hard drive (Seagate ST3500320AS with SD1A firmware) sometimes changes its name from /dev/sdb to /dev/sdd after resuming from hibernate. The frequency of the event is low (about one event per 10 hibernate/resume cycles). Other HDDs (SAMSUNG HD321KJ (/dev/sda) and WDC WD15EARS-00Z5B1 (/dev/sdc)) never change their /dev/sdX names after resume (I have not noticed that yet).
I have experienced this problem on kernels 126.96.36.199 and 188.8.131.52 on Debian squeeze amd64. Kernels are from kernel.org and built with make-kpkg.
Kernel configs and logs are attached.
Note that numbers in "n_sectors mismatch" message:
n_sectors mismatch 976773168 != 976771055
are the same as in "HPA detected" message:
HPA detected: current 976771055, native 976773168
I'll attach pure dmesg output as soon as I'll be lucky in reproducing this bug.
Created attachment 25218 [details]
normal bootup with sequence of 5 successful hibernate/resume cycles for 184.108.40.206
Created attachment 25219 [details]
kernel config 220.127.116.11
Created attachment 25220 [details]
normal bootup and bad resume after hibernate for 18.104.22.168
Created attachment 25221 [details]
kernel config 22.214.171.124
Created attachment 25229 [details]
Can you please apply this patch, trigger the bug and then report the kernel log?
Created attachment 25249 [details]
normal bootup and bad resume after hibernate for 126.96.36.199 with reval-debug.patch
without the patch we have
n_sectors mismatch 976773168 != 976771055
with patched kernel we have
n_sectors mismatch dev->n_sectors=976771055 dev->n_native=976773168 n_sectors=976773168
Created attachment 25306 [details]
Oh... looks like n_native_sectors has changed. Can you please try this one and report the log?
I am trying to trigger the bug using kernel with reval-debug-1.patch but still with no luck.
To speedup the procedure I was playing with different patches and had revealed another (probably related) issue.
When doing suspend to disk, sd_resume() routine is called also *before* poweroff. This call is not logged, i.e. after suspend system lives during some time and does sd_resume() for each disk.
Is this correct behavior?
I have attached the patch for sd.c, dmesg output after resume and screenshot (taken just before suspend's poweroff) showing three sd_suspend() calls followed by three sd_resume() calls. These sd_resume() calls do not appear in the system logs (look at the times in the first column) and appear only on the screen (printed with printk(KERN_EMERG, ... ) ).
P.S.: I'll post kernel log with reval-debug-1.patch here as soon as I'll be lucky in triggering the bug again.
Created attachment 25373 [details]
sd_* calls tracing - the patch
Created attachment 25374 [details]
sd_* calls tracing - dmesg output
Created attachment 25375 [details]
sd_* calls tracing - screenshot before suspend's poweroff with three unlogged sd_resume() calls
Yeap, that's how the disk image is written. It first freezes the controller by calling suspend, take a snapshot in memory, resumes the device and writes it out and then shut it down again by calling sd_suspend again. No luck reproducing the bug yet?
Created attachment 25434 [details]
normal bootup and bad resume after hibernate for 188.8.131.52 with reval-debug-1.patch
Today I was lucky :)
The bug have been reproduced.
In brief with reval-debug-1.patch we have:
n_sectors mismatch dev->n_sectors=976771055 dev->n_native=976773168 n_sectors=976773168 n_native_sectors=976773168
It seems to me would be great to write code that will produce a sequence of usual suspend/resume procedures for the specific disk (even with hardcoded name). In case the bug is not related to physical poweroff it may dramatically speedup procedure of reproducing.
Need to mention that the drive ST3500320AS belongs to 7200.11 series, which is famous by microcode bugs. See http://en.wikipedia.org/wiki/Seagate_Barracuda#7200.11
P.S.: Thanks for the explanation about suspend to disk procedure. Marking corresponded attachments as obsolete.
(In reply to comment #7)
> Oh... looks like n_native_sectors has changed.
Tejun, if you decide that there is a high probability of the firmware bug, we could wait until it will be reproduced with HDD of another model.
I find it very suspicious that I have not experienced the bug with another two HDDs in my system yet.
I suspect firmware or some other weird problem but if you can reproduce the problem, please post the kernel log with the patch applied. Thanks.
(In reply to comment #15)
> I suspect firmware or some other weird problem but if you can reproduce the
> problem, please post the kernel log with the patch applied. Thanks.
Already done. Though this kernel log is not inside reply to the appropriate message. I have not figured out how to add attachments inside replies using bugzilla web interface.
(In reply to comment #16)
> (In reply to comment #15)
> > I suspect firmware or some other weird problem but if you can reproduce the
> > problem, please post the kernel log with the patch applied. Thanks.
> Already done. Though this kernel log is not inside reply to the appropriate
> message. I have not figured out how to add attachments inside replies using
> bugzilla web interface.
I mean that requested kernel log (with the reval-debug-1.patch applied) is inside an ordinary comment (comment #13), not inside a reply.
Update: Recently I have experienced this bug twice on 184.108.40.206. Nothing special. Four numbers produced by debug message remain the same as for 220.127.116.11.
Ah... sorry, I missed that.
Okay, your BIOS is behaving in a quite strange way. The BIOS is not configuring HPA on boot but just occasionally it configures HPA during resume. This makes that the disk appears to be clipped on resume so the kernel can't use the device as-is. It usually is the other way around. BIOS sets up HPA during boot but forgets to do it during resume which the kernel accepts as device getting larger usually doesn't pose actual problem. I'll give a shot at workaround for this case too but for the time being libata.ignore_hpa=1 should fix your problem.
Created attachment 25747 [details]
Can you please test this patch without specifying libata.ignore_hpa=1? With this patch applied, the kernel will report
ataP.DD: old n_sectors matches native, probably late HPA lock, will try to unlock HPA
and unlock HPA and you should be able to keep the device.
Created attachment 25853 [details]
normal bootup with sequence of 3 successful hibernate/resume cycles for 18.104.22.168 with unlock-hpa-if-device-shrunk.patch
(In reply to comment #19)
> Can you please test this patch without specifying libata.ignore_hpa=1?
dmesg output of normal bootup with sequence of 3 successful hibernate/resume cycles is uploaded.
In brief, messages
ataX.DD: n_sectors mismatch 976773168 != 976771055
ataX.DD: old n_sectors matches native, probably late HPA lock, will try to unlock HPA
ataX.DD: revalidation failed (errno=-5)
are printed during only the first hibernate/resume cycle.
I will continue to test this patch and report if I get any more problems.
Thanks for your efforts!
Cool, will forward patch upstream. Thanks for testing.
Is the problem still present in 2.6.37?
(In reply to comment #23)
> Is the problem still present in 2.6.37?
The most recent version of the kernel I have tested is 22.214.171.124. There was no problem. Now I am happy with debian stock 2.6.32 kernel since debian maintainers have backported unlock-hpa-if-device-shrunk.patch to it. Assuming the patch is still incorporated to 2.6.37 there will be no problem too, although I have not tested it yet (and probably can't test it in easy way because of NVIDIA blob version I am using). But I'll try.
OK, thanks for the information. I'm marking this as fixed, please file
a separate bug if it is broken again in a later kernel.