Bug 31432
Summary: | Resume from suspend fails on new FireWire stack (TSB43AB22/A IEEE-1394a-2000 Controller) | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jaša Bartelj (jasa.bartelj) |
Component: | IEEE1394 | Assignee: | drivers_ieee1394 |
Status: | CLOSED INVALID | ||
Severity: | normal | CC: | rjw, stefanr |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.38 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: |
dmesg of 2.6.38
syslog from "firewire-ohci debug=15" before suspend, from Jaša Bartelj |
On Mar 19 bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=31432 > > > Rafael J. Wysocki <rjw@sisk.pl> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |rjw@sisk.pl > Component|Hibernation/Suspend |IEEE1394 > Blocks| |7216 > AssignedTo|power-management_other@kern |drivers_ieee1394@kernel-bug > |el-bugs.osdl.org |s.osdl.org > Product|Power Management |Drivers Quoting the bug description for linux1394-devel: > Description From Jaša 2011-03-19 12:34:00 > Created an attachment (id=51242) [details] > dmesg of 2.6.38 [https://bugzilla.kernel.org/attachment.cgi?id=51242] > > I have diagnosed a regression in several kernel versions (at least > 2.6.35.x and current stable version 2.6.38). > > After upgrading from 2.6.32 (Ubuntu 10.04) to 2.6.35 (Ubunutu 10.10) > resume from suspend failed so I came across the /sys/power/pm_trace > tracing method and with that info I narrowed the issue to a firewire > device which seems to be the issue. I also tried the recently released > 2.6.38 mainline version. > > A workaround which confirmed my suspicions is removing firewire modules: > # modprobe -r firewire_core firewire_ohci > Then resume works well. > > Here's the evidence: > > $ uname -a > Linux sendell 2.6.38-020638-generic #201103151303 SMP Tue Mar 15 > 13:08:09 UTC 2011 x86_64 GNU/Linux > > relevant part of dmesg: > [ 1.181998] PM: Hibernation image not present or could not be loaded. > [ 1.182017] registered taskstats version 1 > [ 1.182307] Magic number: 0:117:476 > [ 1.182310] hash matches > /home/kernel-ppa/COD/linux/drivers/base/power/main.c:514 > [ 1.182370] pci 0000:03:05.0: hash matches > [ 1.182442] rtc_cmos 00:05: setting system clock to 2024-03-02 10:28:40 > UTC (1709375320) > > $ lspci | grep 03:05 > 03:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 > Controller (PHY/Link) > > $ grep 03:05 dmesg-38.txt > [ 0.360052] pci 0000:03:05.0: [104c:8023] type 0 class 0x000c00 > [ 0.360067] pci 0000:03:05.0: reg 10: [mem 0xfdeff000-0xfdeff7ff] > [ 0.360076] pci 0000:03:05.0: reg 14: [mem 0xfdef8000-0xfdefbfff] > [ 0.360123] pci 0000:03:05.0: supports D1 D2 > [ 0.360125] pci 0000:03:05.0: PME# supported from D0 D1 D2 D3hot > [ 0.360130] pci 0000:03:05.0: PME# disabled > > I've posted this on Launchpad (Bug #732641) but no responses at all > there. Alas there are no messages from the firewire drivers in this log. Not even a message from PCI core that a firewire-ohci driver method failed. Hence we (linux1394-devel people) have nothing to go on yet. Can somebody who knows power management explain what should be investigated next? The downstream tracker item is: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/732641 It contains a somewhat different failure description from a 2.6.35 Ubuntu kernel: > Since upgrade to 10.10/kernel 2.6.35 kernel doesn't complete resume from > suspend. > > $ uname -a > Linux sendell 2.6.35-27-generic #48-Ubuntu SMP Tue Feb 22 20:25:46 UTC > 2011 x86_64 GNU/Linux > > Xorg vty is blank, I can change to other vtys but when i return to the > one with Xorg, switching no longer works. Only manual reset remedies > this situation. > > Tried > > https://wiki.ubuntu.com/DebuggingKernelSuspend#%22resume-trace%22%20debugging%20procedure%20for%20finding%20buggy%20drivers > > [ 0.714873] PM: Resume from disk failed. > [ 0.714886] registered taskstats version 1 > [ 0.715167] Magic number: 0:608:476 > [ 0.715170] hash matches > /build/buildd/linux-2.6.35/drivers/base/power/main.c:545 > [ 0.715213] pci 0000:03:05.0: hash matches > [ 0.715281] rtc_cmos 00:05: setting system clock to 1980-09-08 10:27:51 UTC > (337256871) > > $ lspci | grep 03:05 > 03:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 > Controller (PHY/Link) > > $ dmesg|grep fire > [ 0.380065] pci 0000:03:05.0: reg 10: [mem 0xfdeff000-0xfdeff7ff] > [ 0.380071] pci 0000:03:05.0: reg 14: [mem 0xfdef8000-0xfdefbfff] > [ 0.380105] pci 0000:03:05.0: supports D1 D2 > [ 0.380107] pci 0000:03:05.0: PME# supported from D0 D1 D2 D3hot > [ 0.380111] pci 0000:03:05.0: PME# disabled > [ 0.715213] pci 0000:03:05.0: hash matches > [ 0.893472] firewire_ohci 0000:03:05.0: PCI INT A -> Link[APC4] -> GSI 19 > (level, low) -> IRQ 19 > [ 0.950061] firewire_ohci: Added fw-ohci device 0000:03:05.0, OHCI v1.10, 4 > IR + 8 IT contexts, quirks 0x2 > > $ lsmod|grep fire > firewire_ohci 24839 0 > firewire_core 54327 1 firewire_ohci > crc_itu_t 1739 1 firewire_core > > $sudo modprobe -r firewire_core firewire_ohci > > And then resuming from suspend works. > Jaša, are all of the dmesg messages in this report from the same session? Linux-pm folks, why is the PM Restore failure reported at 0.714873 but the system proceeds and logs successful attachment of firewire-ohci's pci_probe to the PCI device? Also, why is base/power/main.c::device_resume mentioned in the PM trace but there is no hint in the log that firewire-ohci's pci_resume was called? Side note: While firewire-ohci's pci_probe succeeded, the controller might actually not have become operational. Usually the success message from firewire-ohci should be followed on circa 500 ms afterwards by a message like this: "firewire_core: created device fw0: GUID 00110600000041cc, S400" Can somebody comment on the phenomenon that VT switching was affected? ... Well, maybe firewire-core hung in a workqueue job which in turn prevented subsequent processing of other subsystems' workqueue jobs. This can be much more of a problem with a kernel like 2.6.35 than with newer ones with concurrency managed workqueues (2.6.36 and later). (Jaša, prefer reply-to-all for responses. Thanks for your reports so far.) > Alas there are no messages from the firewire drivers in this log. Not even > a message from PCI core that a firewire-ohci driver method failed. Hence > we (linux1394-devel people) have nothing to go on yet. This is because I can't get the message buffer from the hung kernel. The log is from a forcefully rebooted kernel which logs the trace information which is stored in the RTC between reboots (supposedly this is the only non-volatile register available). You can see that overwriting the RTC results in date in the 2020's. More here: https://wiki.ubuntu.com/DebuggingKernelSuspend#%22resume-trace%22%20debugging%20procedure%20for%20finding%20buggy%20drivers > The downstream tracker item is: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/732641 > > It contains a somewhat different failure description from a 2.6.35 Ubuntu > kernel: > ... > > Jaša, are all of the dmesg messages in this report from the same session? dmesg messages on Launchpad are from 2.6.35 as it is the shipped kernel in Ubuntu 10.10. In the comment on Launchpad you can also see dmesg from 2.6.38-rc8 that was latest mainline at the time. On this kernel bugzilla I posted the dmesg of the just released 2.6.38 stable kernel because I think this version is most relevant at this time (right?). > Linux-pm folks, why is the PM Restore failure reported at 0.714873 but the > system proceeds and logs successful attachment of firewire-ohci's pci_probe > to the PCI device? I think you can find an explanation for this in my first paragraph of this email. > Can somebody comment on the phenomenon that VT switching was affected? ... > Well, maybe firewire-core hung in a workqueue job which in turn prevented > subsequent processing of other subsystems' workqueue jobs. This can be > much more of a problem with a kernel like 2.6.35 than with newer ones > with concurrency managed workqueues (2.6.36 and later). The VT switching is only possible on 2.6.35 for me. That is why I only mentioned it on Launchpad. With .38 my display receives no signal after attempting a resume. > (Jaša, prefer reply-to-all for responses. Thanks for your reports so far.) Thanks for your patience, this is my first bug report here. :) Is the failure in resume from suspend(-to-RAM), or in restore from hibernate (alias resume from suspend-to-disk), or both? > Is the failure in resume from suspend(-to-RAM), or in restore from > hibernate (alias resume from suspend-to-disk), or both? The described failure is from suspend-to-RAM. I have just tried hibernation with firewire modules loaded and the kernel does not complete it. It hangs at a VT with only a blinking cursor on the top left. A hard reset is required to regain control! Here's what I see in the messages log: http://pastebin.com/rz0c8hKM Created attachment 51912 [details]
syslog from "firewire-ohci debug=15" before suspend, from Jaša Bartelj
Via IRC, Jaša reported the attached driver messages from a session with firewire-ohci IRQ event logging switched on by "modprobe firewire-ohci debug=15". There is a never-ending stream of bus reset events at a high frequenncy. (The logging makes the PC entirely unresponsible.)
This means that the physical interface of the FireWire chip is most probably defective. Jaša did not notice any issue of this sort until now because (1.) he did not yet use any FireWire devices on this controller, (2.) without flag 8 of firewire-ohci's debug parameter, firewire-ohci does not enable bus reset IRQs. (Bus reset IRQ events mark the beginning of a bus reset phase; firewire-ohci normally only enables self ID compelete IRQ events which mark the ending of a bus reset phase.)
At the moment, I tend to think this bug should be closed because this is very probably a hardware defect. (And if it is, it is of a kind that cannot sensibly worked around in the driver.)
To keep the mailinglists in the loop: This appears to be a hardware defect. IRQ event logging switched on before suspend reveals an endless storm of bus reset events which hints to a damaged PHY. https://bugzilla.kernel.org/show_bug.cgi?id=31432#c5 Jaša, one question for posterity: Do you have a chance to open the case and have a look what is stamped on the chip? I wonder whether this is a TSB43AB22 or TSB43AB22A. (So far there aren't any differences between these versions known, at least not to me, nor how common or uncommon each of the two versions is. I have a TSB43AB22/A myself but can't check which one because it is hidden in the metal case of a CardBus card.) > one question for posterity: Do you have a chance to open the case and > have a look what is stamped on the chip? I wonder whether this is a > TSB43AB22 or TSB43AB22A. Sure, the markings on the chip are: 6AA0LTW TSB43AB22A _G4_ > At the moment, I tend to think this bug should be closed because this is very > probably a hardware defect. (And if it is, it is of a kind that cannot > sensibly worked around in the driver.) That's unfortunate to read because on the older ieee1394 stack the system didn't fail at resume. Myself, I can just disable the device in BIOS and cease concerning myself about it in general the old behaviour was better. Thank you for your time! On Mar 26 Jaša Bartelj wrote: > the markings on the chip are: > 6AA0LTW > TSB43AB22A > _G4_ OK. So we are not tripping over a potential unknown issue with the older TSB43AB22 (which Texas Instruments still ship but don't recommend to use). [bug closed] > That's unfortunate to read because on the older ieee1394 stack the > system didn't fail at resume. > Myself, I can just disable the device in BIOS and cease concerning > myself about it in general the old behaviour was better. > > Thank you for your time! Bus reset handling and device scanning were fundamentally reworked in the newer kernel drivers compared to the older ones. The old code had races and was ineffective in this department. That the new code happens to regress with an apparent defective chip is indeed unfortunate. But I do not see how this could be addressed. Thanks for your good reports and sorry that it did not lead to a satisfying solution. |
Created attachment 51242 [details] dmesg of 2.6.38 I have diagnosed a regression in several kernel versions (at least 2.6.35.x and current stable version 2.6.38). After upgrading from 2.6.32 (Ubuntu 10.04) to 2.6.35 (Ubunutu 10.10) resume from suspend failed so I came across the /sys/power/pm_trace tracing method and with that info I narrowed the issue to a firewire device which seems to be the issue. I also tried the recently released 2.6.38 mainline version. A workaround which confirmed my suspicions is removing firewire modules: # modprobe -r firewire_core firewire_ohci Then resume works well. Here's the evidence: $ uname -a Linux sendell 2.6.38-020638-generic #201103151303 SMP Tue Mar 15 13:08:09 UTC 2011 x86_64 GNU/Linux relevant part of dmesg: [ 1.181998] PM: Hibernation image not present or could not be loaded. [ 1.182017] registered taskstats version 1 [ 1.182307] Magic number: 0:117:476 [ 1.182310] hash matches /home/kernel-ppa/COD/linux/drivers/base/power/main .c:514 [ 1.182370] pci 0000:03:05.0: hash matches [ 1.182442] rtc_cmos 00:05: setting system clock to 2024-03-02 10:28:40 UTC ( 1709375320) $ lspci | grep 03:05 03:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) $ grep 03:05 dmesg-38.txt [ 0.360052] pci 0000:03:05.0: [104c:8023] type 0 class 0x000c00 [ 0.360067] pci 0000:03:05.0: reg 10: [mem 0xfdeff000-0xfdeff7ff] [ 0.360076] pci 0000:03:05.0: reg 14: [mem 0xfdef8000-0xfdefbfff] [ 0.360123] pci 0000:03:05.0: supports D1 D2 [ 0.360125] pci 0000:03:05.0: PME# supported from D0 D1 D2 D3hot [ 0.360130] pci 0000:03:05.0: PME# disabled I've posted this on Launchpad (Bug #732641) but no responses at all there.