Bug 31432

Summary: Resume from suspend fails on new FireWire stack (TSB43AB22/A IEEE-1394a-2000 Controller)
Product: Drivers Reporter: Jaša Bartelj (jasa.bartelj)
Component: IEEE1394Assignee: drivers_ieee1394
Status: CLOSED INVALID    
Severity: normal CC: rjw, stefanr
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: dmesg of 2.6.38
syslog from "firewire-ohci debug=15" before suspend, from Jaša Bartelj

Description Jaša Bartelj 2011-03-19 12:34:00 UTC
Created attachment 51242 [details]
dmesg of 2.6.38

I have diagnosed a regression in several kernel versions (at least 2.6.35.x and current stable version 2.6.38). 

After upgrading from 2.6.32 (Ubuntu 10.04) to 2.6.35 (Ubunutu 10.10) resume from suspend failed so I came across the /sys/power/pm_trace tracing method and with that info I narrowed the issue to a firewire device which seems to be the issue. I also tried the recently released 2.6.38 mainline version.

A workaround which confirmed my suspicions is removing firewire modules:
# modprobe -r firewire_core firewire_ohci
Then resume works well.

Here's the evidence:

$ uname -a
Linux sendell 2.6.38-020638-generic #201103151303 SMP Tue Mar 15 13:08:09 UTC 2011 x86_64 GNU/Linux

relevant part of dmesg:
[    1.181998] PM: Hibernation image not present or could not be loaded.
[    1.182017] registered taskstats version 1
[    1.182307]   Magic number: 0:117:476
[    1.182310]   hash matches /home/kernel-ppa/COD/linux/drivers/base/power/main
.c:514
[    1.182370] pci 0000:03:05.0: hash matches
[    1.182442] rtc_cmos 00:05: setting system clock to 2024-03-02 10:28:40 UTC (
1709375320)

$ lspci | grep 03:05
03:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)

$ grep 03:05 dmesg-38.txt 
[    0.360052] pci 0000:03:05.0: [104c:8023] type 0 class 0x000c00
[    0.360067] pci 0000:03:05.0: reg 10: [mem 0xfdeff000-0xfdeff7ff]
[    0.360076] pci 0000:03:05.0: reg 14: [mem 0xfdef8000-0xfdefbfff]
[    0.360123] pci 0000:03:05.0: supports D1 D2
[    0.360125] pci 0000:03:05.0: PME# supported from D0 D1 D2 D3hot
[    0.360130] pci 0000:03:05.0: PME# disabled

I've posted this on Launchpad (Bug #732641) but no responses at all there.
Comment 1 Stefan Richter 2011-03-19 14:46:43 UTC
On Mar 19 bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=31432
> 
> 
> Rafael J. Wysocki <rjw@sisk.pl> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |rjw@sisk.pl
>           Component|Hibernation/Suspend         |IEEE1394
>              Blocks|                            |7216
>          AssignedTo|power-management_other@kern |drivers_ieee1394@kernel-bug
>                    |el-bugs.osdl.org            |s.osdl.org
>             Product|Power Management            |Drivers

Quoting the bug description for linux1394-devel:

> Description From  Jaša  2011-03-19 12:34:00  
> Created an attachment (id=51242) [details]
> dmesg of 2.6.38

[https://bugzilla.kernel.org/attachment.cgi?id=51242]

> 
> I have diagnosed a regression in several kernel versions (at least
> 2.6.35.x and current stable version 2.6.38). 
> 
> After upgrading from 2.6.32 (Ubuntu 10.04) to 2.6.35 (Ubunutu 10.10)
> resume from suspend failed so I came across the /sys/power/pm_trace
> tracing method and with that info I narrowed the issue to a firewire
> device which seems to be the issue. I also tried the recently released
> 2.6.38 mainline version.
> 
> A workaround which confirmed my suspicions is removing firewire modules:
> # modprobe -r firewire_core firewire_ohci
> Then resume works well.
> 
> Here's the evidence:
> 
> $ uname -a
> Linux sendell 2.6.38-020638-generic #201103151303 SMP Tue Mar 15
> 13:08:09 UTC 2011 x86_64 GNU/Linux
> 
> relevant part of dmesg:
> [    1.181998] PM: Hibernation image not present or could not be loaded.
> [    1.182017] registered taskstats version 1
> [    1.182307]   Magic number: 0:117:476
> [    1.182310]   hash matches
> /home/kernel-ppa/COD/linux/drivers/base/power/main.c:514
> [    1.182370] pci 0000:03:05.0: hash matches
> [    1.182442] rtc_cmos 00:05: setting system clock to 2024-03-02 10:28:40
> UTC (1709375320)
> 
> $ lspci | grep 03:05
> 03:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
> Controller (PHY/Link)
> 
> $ grep 03:05 dmesg-38.txt 
> [    0.360052] pci 0000:03:05.0: [104c:8023] type 0 class 0x000c00
> [    0.360067] pci 0000:03:05.0: reg 10: [mem 0xfdeff000-0xfdeff7ff]
> [    0.360076] pci 0000:03:05.0: reg 14: [mem 0xfdef8000-0xfdefbfff]
> [    0.360123] pci 0000:03:05.0: supports D1 D2
> [    0.360125] pci 0000:03:05.0: PME# supported from D0 D1 D2 D3hot
> [    0.360130] pci 0000:03:05.0: PME# disabled
> 
> I've posted this on Launchpad (Bug #732641) but no responses at all
> there.

Alas there are no messages from the firewire drivers in this log.  Not even
a message from PCI core that a firewire-ohci driver method failed.  Hence
we (linux1394-devel people) have nothing to go on yet.

Can somebody who knows power management explain what should be
investigated next?

The downstream tracker item is:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/732641

It contains a somewhat different failure description from a 2.6.35 Ubuntu
kernel:

> Since upgrade to 10.10/kernel 2.6.35 kernel doesn't complete resume from
> suspend.
> 
> $ uname -a
> Linux sendell 2.6.35-27-generic #48-Ubuntu SMP Tue Feb 22 20:25:46 UTC
> 2011 x86_64 GNU/Linux
> 
> Xorg vty is blank, I can change to other vtys but when i return to the
> one with Xorg, switching no longer works. Only manual reset remedies
> this situation.
> 
> Tried
>
> https://wiki.ubuntu.com/DebuggingKernelSuspend#%22resume-trace%22%20debugging%20procedure%20for%20finding%20buggy%20drivers
> 
> [ 0.714873] PM: Resume from disk failed.
> [ 0.714886] registered taskstats version 1
> [ 0.715167] Magic number: 0:608:476
> [ 0.715170] hash matches
> /build/buildd/linux-2.6.35/drivers/base/power/main.c:545
> [ 0.715213] pci 0000:03:05.0: hash matches
> [ 0.715281] rtc_cmos 00:05: setting system clock to 1980-09-08 10:27:51 UTC
> (337256871)
> 
> $ lspci | grep 03:05
> 03:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
> Controller (PHY/Link)
> 
> $ dmesg|grep fire
> [ 0.380065] pci 0000:03:05.0: reg 10: [mem 0xfdeff000-0xfdeff7ff]
> [ 0.380071] pci 0000:03:05.0: reg 14: [mem 0xfdef8000-0xfdefbfff]
> [ 0.380105] pci 0000:03:05.0: supports D1 D2
> [ 0.380107] pci 0000:03:05.0: PME# supported from D0 D1 D2 D3hot
> [ 0.380111] pci 0000:03:05.0: PME# disabled
> [ 0.715213] pci 0000:03:05.0: hash matches
> [ 0.893472] firewire_ohci 0000:03:05.0: PCI INT A -> Link[APC4] -> GSI 19
> (level, low) -> IRQ 19
> [ 0.950061] firewire_ohci: Added fw-ohci device 0000:03:05.0, OHCI v1.10, 4
> IR + 8 IT contexts, quirks 0x2
> 
> $ lsmod|grep fire
> firewire_ohci 24839 0
> firewire_core 54327 1 firewire_ohci
> crc_itu_t 1739 1 firewire_core
> 
> $sudo modprobe -r firewire_core firewire_ohci
> 
> And then resuming from suspend works.
> 

Jaša, are all of the dmesg messages in this report from the same session?

Linux-pm folks, why is the PM Restore failure reported at 0.714873 but the
system proceeds and logs successful attachment of firewire-ohci's pci_probe
to the PCI device?

Also, why is base/power/main.c::device_resume mentioned in the PM trace
but there is no hint in the log that firewire-ohci's pci_resume was called?

Side note:  While firewire-ohci's pci_probe succeeded, the controller
might actually not have become operational.  Usually the success message
from firewire-ohci should be followed on circa 500 ms afterwards by a
message like this: "firewire_core: created device fw0: GUID
00110600000041cc, S400"

Can somebody comment on the phenomenon that VT switching was affected? ...
Well, maybe firewire-core hung in a workqueue job which in turn prevented
subsequent processing of other subsystems' workqueue jobs.  This can be
much more of a problem with a kernel like 2.6.35 than with newer ones
with concurrency managed workqueues (2.6.36 and later).

(Jaša, prefer reply-to-all for responses. Thanks for your reports so far.)
Comment 2 Jaša Bartelj 2011-03-19 18:00:53 UTC
> Alas there are no messages from the firewire drivers in this log.  Not even
> a message from PCI core that a firewire-ohci driver method failed.  Hence
> we (linux1394-devel people) have nothing to go on yet.

This is because I can't get the message buffer from the hung kernel.
The log is from a forcefully rebooted kernel which logs the trace
information which is stored in the RTC between reboots (supposedly
this is the only non-volatile register available). You can see that
overwriting the RTC results in date in the 2020's. More here:
https://wiki.ubuntu.com/DebuggingKernelSuspend#%22resume-trace%22%20debugging%20procedure%20for%20finding%20buggy%20drivers


> The downstream tracker item is:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/732641
>
> It contains a somewhat different failure description from a 2.6.35 Ubuntu
> kernel:
>
...
>
> Jaša, are all of the dmesg messages in this report from the same session?

dmesg messages on Launchpad are from 2.6.35 as it is the shipped
kernel in Ubuntu 10.10. In the comment on Launchpad you can also see
dmesg from 2.6.38-rc8 that was latest mainline at the time.
On this kernel bugzilla I posted the dmesg of the just released 2.6.38
stable kernel because I think this version is most relevant at this
time (right?).


> Linux-pm folks, why is the PM Restore failure reported at 0.714873 but the
> system proceeds and logs successful attachment of firewire-ohci's pci_probe
> to the PCI device?

I think you can find an explanation for this in my first paragraph of
this email.


> Can somebody comment on the phenomenon that VT switching was affected? ...
> Well, maybe firewire-core hung in a workqueue job which in turn prevented
> subsequent processing of other subsystems' workqueue jobs.  This can be
> much more of a problem with a kernel like 2.6.35 than with newer ones
> with concurrency managed workqueues (2.6.36 and later).

The VT switching is only possible on 2.6.35 for me. That is why I only
mentioned it on Launchpad. With .38 my display receives no signal
after attempting a resume.


> (Jaša, prefer reply-to-all for responses. Thanks for your reports so far.)

Thanks for your patience, this is my first bug report here. :)
Comment 3 Stefan Richter 2011-03-21 06:59:29 UTC
Is the failure in resume from suspend(-to-RAM), or in restore from
hibernate (alias resume from suspend-to-disk), or both?
Comment 4 Jaša Bartelj 2011-03-21 19:03:13 UTC
> Is the failure in resume from suspend(-to-RAM), or in restore from
> hibernate (alias resume from suspend-to-disk), or both?

The described failure is from suspend-to-RAM.

I have just tried hibernation with firewire modules loaded and the
kernel does not complete it. It hangs at a VT with only a blinking
cursor on the top left. A hard reset is required to regain control!
Here's what I see in the messages log: http://pastebin.com/rz0c8hKM
Comment 5 Stefan Richter 2011-03-24 20:34:49 UTC
Created attachment 51912 [details]
syslog from "firewire-ohci debug=15" before suspend, from Jaša Bartelj

Via IRC, Jaša reported the attached driver messages from a session with firewire-ohci IRQ event logging switched on by "modprobe firewire-ohci debug=15".  There is a never-ending stream of bus reset events at a high frequenncy.  (The logging makes the PC entirely unresponsible.)

This means that the physical interface of the FireWire chip is most probably defective.  Jaša did not notice any issue of this sort until now because (1.) he did not yet use any FireWire devices on this controller, (2.) without flag 8 of firewire-ohci's debug parameter, firewire-ohci does not enable bus reset IRQs.  (Bus reset IRQ events mark the beginning of a bus reset phase; firewire-ohci normally only enables self ID compelete IRQ events which mark the ending of a bus reset phase.)

At the moment, I tend to think this bug should be closed because this is very probably a hardware defect.  (And if it is, it is of a kind that cannot sensibly worked around in the driver.)
Comment 6 Stefan Richter 2011-03-24 20:43:07 UTC
To keep the mailinglists in the loop:

This appears to be a hardware defect.  IRQ event logging switched on before
suspend reveals an endless storm of bus reset events which hints to a
damaged PHY.  https://bugzilla.kernel.org/show_bug.cgi?id=31432#c5
Comment 7 Stefan Richter 2011-03-24 23:34:43 UTC
Jaša,
one question for posterity:  Do you have a chance to open the case and
have a look what is stamped on the chip?  I wonder whether this is a
TSB43AB22 or TSB43AB22A.

(So far there aren't any differences between these versions
known, at least not to me, nor how common or uncommon each of the two
versions is.  I have a TSB43AB22/A myself but can't check which one
because it is hidden in the metal case of a CardBus card.)
Comment 8 Jaša Bartelj 2011-03-26 12:27:22 UTC
> one question for posterity:  Do you have a chance to open the case and
> have a look what is stamped on the chip?  I wonder whether this is a
> TSB43AB22 or TSB43AB22A.

Sure, the markings on the chip are:
6AA0LTW
TSB43AB22A
_G4_

> At the moment, I tend to think this bug should be closed because this is very
> probably a hardware defect.  (And if it is, it is of a kind that cannot
> sensibly worked around in the driver.)

That's unfortunate to read because on the older ieee1394 stack the
system didn't fail at resume.
Myself, I can just disable the device in BIOS and cease concerning
myself about it in general the old behaviour was better.

Thank you for your time!
Comment 9 Stefan Richter 2011-03-26 13:26:40 UTC
On Mar 26 Jaša Bartelj wrote:
> the markings on the chip are:
> 6AA0LTW
> TSB43AB22A
> _G4_

OK.  So we are not tripping over a potential unknown issue with the older
TSB43AB22 (which Texas Instruments still ship but don't recommend to use).

[bug closed]
> That's unfortunate to read because on the older ieee1394 stack the
> system didn't fail at resume.
> Myself, I can just disable the device in BIOS and cease concerning
> myself about it in general the old behaviour was better.
> 
> Thank you for your time!

Bus reset handling and device scanning were fundamentally reworked in the
newer kernel drivers compared to the older ones.  The old code had
races and was ineffective in this department.  That the new code happens to
regress with an apparent defective chip is indeed unfortunate.  But I do
not see how this could be addressed.

Thanks for your good reports and sorry that it did not lead to a
satisfying solution.