Bug 7072 - ieee1394 doesn't work after resume
Summary: ieee1394 doesn't work after resume
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: IEEE1394 (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Stefan Richter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-08-29 10:20 UTC by Ritesh Raj Sarraf
Modified: 2007-04-28 12:49 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.17.11
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmesg output (15.12 KB, text/plain)
2006-11-04 06:23 UTC, Ritesh Raj Sarraf
Details
lspci output (1.83 KB, text/plain)
2006-11-04 06:24 UTC, Ritesh Raj Sarraf
Details
patch to let external nodes rediscover the resuming node (3.56 KB, patch)
2007-01-07 13:00 UTC, Stefan Richter
Details | Diff

Description Ritesh Raj Sarraf 2006-08-29 10:20:27 UTC
Most recent kernel where this bug did not occur: N/A
Distribution: Debian testing/unstable

Hardware Environment:
Intel Pentium M 1.5 Ghz
RAM: 768MB

Software Environment:
Debian testing/unstable
Vanilla Kernel 2.6.17.11 + Software Suspend 2

Problem Description:
On resume from a hibernated state, ieee1394 networking doesn't work. The 
ifconfig out shows everything correct but you cannot ping to any host on that 
network.
Reloading the modules also doesn't help.


Steps to reproduce:
1) Install linux with Software Suspend 2
2) Hibernate.
3) On resume, the ieee1394 network doesn't work.


Here are the logs:
While doing a ping, the kernel logged the following messages.
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00030014/0x000932d1] (count: 0)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00040014/0x000932d1] (count: 1)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00050014/0x000932d1] (count: 2)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00060014/0x000932d1] (count: 3)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00070014/0x000932d1] (count: 4)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00080014/0x000932d1] (count: 5)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x000a0014/0x000932d1] (count: 0)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x000b0014/0x000932d1] (count: 1)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x000c0014/0x000932d1] (count: 2)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x000d0014/0x000932d1] (count: 3)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x000e0014/0x000932d1] (count: 4)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x000f0014/0x000932d1] (count: 5)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00100014/0x000932d1] (count: 6)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00110014/0x000932d1] (count: 7)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00120014/0x000932d1] (count: 8)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00130014/0x000932d1] (count: 9)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00140014/0x000932d1] (count: 10)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00150014/0x000932d1] (count: 11)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00160014/0x000932d1] (count: 12)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00170014/0x000932d1] (count: 13)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00180014/0x000932d1] (count: 14)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00190014/0x000932d1] (count: 15)
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x001a0014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x001b0014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x001c0014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x001d0014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x001e0014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ieee1394: Stopping reset loop for IRM sanity


During reload, following messages were observed.
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x001f0014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00200014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00210014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00220014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00230014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ohci1394: fw-host0: Error in reception of SelfID packets 
[0x00240014/0x000932d1] (count: 16)
ohci1394: fw-host0: Too many errors on SelfID error reception, giving up!
ieee1394: impossible ack_complete from node 65535 (tcode 4)
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
ieee1394: Stopping reset loop for IRM sanity
ieee1394: Node removed: ID:BUS[0-00:1023]  GUID[00c09f00001f8a88]
ieee1394: Node removed: ID:BUS[0-00:1023]  GUID[354fc0002226c838]
ieee1394: Unknown parameter `sbp2'
ieee1394: Initialized config rom entry `ip1394'
ACPI: PCI Interrupt 0000:02:07.0[A] -> Link [LNKF] -> GSI 11 (level, low) -> 
IRQ 11
ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[11]  MMIO=[e0205000-e02057ff]  
Max Packet=[2048]  IR/IT contexts=[4/8]
ieee1394: Host added: ID:BUS[0-00:1023]  GUID[00c09f00001f8a88]
ieee1394: Node added: ID:BUS[0-01:1023]  GUID[354fc0002226c838]
eth1394: eth2: IEEE-1394 IPv4 over 1394 Ethernet (fw-host0)
ieee1394: sbp2: Driver forced to serialize I/O (serialize_io=1)
ieee1394: sbp2: Try serialize_io=0 for better performance
Comment 1 Stefan Richter 2006-08-29 12:34:46 UTC
Apart from some platform code for PPC, ohci1394's suspend and resume hooks do
not save and restore highlevel configuration. (Only save and restore of PCI
status has been added lately.) See OHCI 1.1 appendix A.4.2 [1]. Anything else
above ohci1394, i.e. ieee1394, eth1394 and so on has not been checked for
possible suspend/ resume bugs.

So far ohci1394 needs to be unloaded before suspend. I don't know if ohci1394
could also be unloaded after resume and then be successfully reloaded. Did you
really unload and reload ohci1394 to produce the second half of your log?

This bug should be filed under Category Drivers, Component IEEE 1394.

[1] http://developer.intel.com/technology/1394/download/ohci_11.htm
Comment 2 Ritesh Raj Sarraf 2006-09-01 11:31:52 UTC
Today I got a kernel oop too.

Sep  1 23:53:11 localhost kernel: ieee1394: Node changed: 0-01:1023 -> 
0-00:1023
Sep  1 23:53:11 localhost kernel: ieee1394: Node suspended: ID:BUS[0-00:1023]  
GUID[354fc0002226c838]
Sep  1 23:53:15 localhost kernel: ieee1394: Node changed: 0-00:1023 -> 
0-01:1023
Sep  1 23:53:33 localhost kernel: ieee1394: Node resumed: ID:BUS[0-00:1023]  
GUID[354fc0002226c838]
Sep  1 23:58:23 localhost kernel: ieee1394: Node changed: 0-01:1023 -> 
0-00:1023
Sep  1 23:58:23 localhost kernel: ieee1394: Node suspended: ID:BUS[0-00:1023]  
GUID[354fc0002226c838]
Sep  1 23:58:29 localhost kernel: ieee1394: The root node is not cycle master 
capable; selecting a new root node and resetting...
Sep  1 23:58:29 localhost kernel: ieee1394: Node changed: 0-00:1023 -> 
0-01:1023
Sep  1 23:58:44 localhost kernel: ieee1394: Error parsing configrom for node 
0-00:1023
Sep  2 00:00:01 localhost CRON[18992]: (pam_unix) session opened for user 
logcheck by (uid=0)
Sep  2 00:00:01 localhost /USR/SBIN/CRON[18993]: (logcheck) CMD (   if 
[ -x /usr/sbin/logcheck ]; then nice -n10 /usr/sbin/logcheck; fi)
Sep  2 00:00:16 localhost CRON[18992]: (pam_unix) session closed for user 
logcheck
Sep  2 00:00:21 localhost kernel: ieee1394: Node changed: 0-01:1023 -> 
0-00:1023
Sep  2 00:00:25 localhost kernel: ieee1394: Node changed: 0-00:1023 -> 
0-01:1023
Sep  2 00:00:41 localhost kernel: BUG: unable to handle kernel paging request 
at virtual address ef014000
Sep  2 00:00:41 localhost kernel:  printing eip:
Sep  2 00:00:41 localhost kernel: eea78690
Sep  2 00:00:41 localhost kernel: *pde = 28367067
Sep  2 00:00:41 localhost kernel: *pte = 00000000
Sep  2 00:00:41 localhost kernel: Oops: 0000 [#1]
Sep  2 00:00:41 localhost kernel: PREEMPT
Sep  2 00:00:41 localhost kernel: Modules linked in: appletalk ax25 ipx p8023 
i915 drm kqemu vmnet parport_pc parport vmmon binfmt_misc button ac battery 
autofs4 ipv6 pcmcia tun ipt_MASQUERADE iptable_nat ip_nat act_police 
sch_ingress cls_u32 sch_sfq sch_cbq xt_state ip_conntrack nfnetlink 
iptable_filter ip_tables x_tables fuse pcspkr cpufreq_stats cpufreq_userspace 
cpufreq_powersave cpufreq_conservative cpufreq_ondemand cn video 
speedstep_centrino freq_table sr_mod sbp2 tda9887 tuner saa7115 em28xx joydev 
compat_ioctl32 v4l1_compat v4l2_common ir_common videodev tveeprom mousedev 
tsdev evdev eth1394 psmouse serio_raw ipw2100 ieee80211 ieee80211_crypt 
snd_intel8x0 snd_intel8x0m rtc snd_ac97_codec snd_ac97_bus firmware_class 
snd_pcm_oss snd_mixer_oss snd_pcm yenta_socket rsrc_nonstatic pcmcia_core 
snd_timer i2c_i801 snd soundcore snd_page_alloc i2c_core hw_random shpchp 
pci_hotplug intel_agp agpgart sd_mod dm_mirror dm_snapshot ide_cd cdrom 
ide_disk 8139too usbhid usb_storage scsi_mod piix ohci1394 ieee1394 8139cp
Sep  2 00:00:41 localhost kernel: ii generic ehci_hcd uhci_hcd usbcore thermal 
processor fan
Sep  2 00:00:41 localhost kernel: CPU:    0
Sep  2 00:00:41 localhost kernel: EIP:    0060:[<eea78690>]    Tainted: P      
VLI
Sep  2 00:00:41 localhost kernel: EFLAGS: 00010293   (2.6.17-my-patches #4)
Sep  2 00:00:41 localhost kernel: EIP is at csr1212_parse_keyval+0x57/0x1ee 
[ieee1394]
Sep  2 00:00:41 localhost kernel: eax: ef01303c   ebx: 00000000   ecx: 
000003f0   edx: 00000020
Sep  2 00:00:41 localhost kernel: esi: fffffff4   edi: 00000020   ebp: 
ef015000   esp: ed7cff1c
Sep  2 00:00:41 localhost kernel: ds: 007b   es: 007b   ss: 0068
Sep  2 00:00:41 localhost kernel: Process knodemgrd_0 (pid: 1369, 
threadinfo=ed7ce000 task=dff49a50)
Sep  2 00:00:41 localhost kernel: Stack: ef01303c 000003f0 0000ffff 03fcdff4 
00000000 ef015000 ef013000 ef017000
Sep  2 00:00:41 localhost kernel:        eea78bf6 f0000414 0000ffff 00000014 
00000004 fffffffc ffffffff eea859ec
Sep  2 00:00:41 localhost kernel:        ef015000 ef011000 fffffffc ef01303c 
ef013000 00000014 00000000 ef01303c
Sep  2 00:00:41 localhost kernel: Call Trace:
Sep  2 00:00:41 localhost kernel:  <eea78bf6> _csr1212_read_keyval+0x3cf/0x40b 
[ieee1394]  <eea78daa> csr1212_parse_csr+0x178/0x1b8 [ieee1394]
Sep  2 00:00:41 localhost kernel:  <eea76a8d> nodemgr_host_thread+0x354/0x8a7 
[ieee1394]  <eea76739> nodemgr_host_thread+0x0/0x8a7 [ieee1394]
Sep  2 00:00:41 localhost kernel:  <c0101005> kernel_thread_helper+0x5/0xb
Sep  2 00:00:41 localhost kernel: Code: 24 08 8a 45 00 3c 02 0f 84 53 01 00 00 
3c 03 0f 85 8a 01 00 00 31 f6 c7 44 24 04 00 00 00 00 e9 29 01 00 00 8b 4c 24 
04 8b 04 24 <8b> 54 88 04 85 d2 0f 84 12 01 00 00 89 d1 8b 5d 20 0f c9 89 c8
Sep  2 00:00:41 localhost kernel: EIP: [<eea78690>] 
csr1212_parse_keyval+0x57/0x1ee [ieee1394] SS:ESP 0068:ed7cff1c
Sep  2 00:02:40 localhost kernel:  <6>eth2: Reseting on mode change.
Sep  2 00:02:41 localhost kernel: ADDRCONF(NETDEV_UP): eth2: link is not ready
Sep  2 00:02:41 localhost avahi-daemon[6994]: New relevant interface eth2.IPv4 
for mDNS.
Sep  2 00:02:41 localhost avahi-daemon[6994]: Joining mDNS multicast group on 
interface eth2.IPv4 with address 192.168.1.1.
Sep  2 00:02:41 localhost avahi-daemon[6994]: Registering new address record 
for 192.168.1.1 on eth2.
Sep  2 00:02:41 localhost kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes 
ready
Sep  2 00:02:52 localhost kernel: eth2: no IPv6 routers present
Sep  2 00:06:57 localhost kernel: ieee1394: ether1394 rx: sender nodeid lookup 
failure: 0-00:1023
Sep  2 00:07:28 localhost last message repeated 21 times
Comment 3 Ritesh Raj Sarraf 2006-09-01 12:09:01 UTC
I'm not sure if there are firewire hubs available in the market or not.

But I believe the severity for this bug needs to be raised.

Machine-A and Machine-B are connected to each other via firewire.
Machine-A is a workstation which remains almost online. Machine-B is a laptop 
which is frequently suspended/resumed. The bad part about this bug is that 
when both the machines are connected, and you do a resume on the laptop, the 
kernel firewire information on both the machines is screwed up.
Then if you connect a third laptop to Machine-A via firewire, and do a normal 
boot-up on it, it won't be able to use the firewire network with Machine-A.

The kernel oops message in the previous comment that I've sent you is from my 
Machine-A which I use almost like a workstation.

Hence, I feel the severity of this bug needs to be raised more because it not 
just affects itself but also other machines on the network that use firewire.

If you're convinced, please raise the severity.

Thanks,
Ritesh
Comment 4 Stefan Richter 2006-09-01 13:12:53 UTC
I think you just found at least one other independent bug ("unable to handle
paging request" in csr1212_parse_keyval). I filed it as bug 7098.

After Machine-B suspended and resumed, will Machine-A and Machine-C be unable to
communicate everytime or only after such an oops? (Assumed the oops does not
happen everytime...)

With FireWire, every node which has more than one port is a hub. Often the PHY
(physical interface chip) is separate from the LLC (link layer controller or
just "link", e.g. the OHCI FireWire-to-PCI bus bridge). But sometimes they are
integrated and/or the PHY is enabled or disabled via the LLC. So if you
daisy-chained A--B--C and suspended (and possibly resumed) B, B's PHY might
still be disabled. With C--A--B, communication between A and C should be
possible unless B floods the bus with resets or other bogus signals.
Comment 5 Ritesh Raj Sarraf 2006-09-01 13:37:35 UTC
This is how I came to the conclusion that this bug will affect other machines 
on the network.

I have two laptops with me which currently I network using firewire.
The old laptop (Machine-A) is used almost as a workstation as it remains up 
for longer durations (sometimes days).
My new laptop (Machine-B) is what I carry with me. I use suspend/resume on it.

So here's how it goes, 
* I boot up both, Machine-A and Machine-B, and both talk to each other.
* I, then, hibernate Machine-B and go out for a coffee. At this time Machine-A 
is on.
* I return back and resume Machine-B. At this moment firewire network gets 
corrupted. Both the machine's kernel log lots of messages to syslog. And 
sometimes they oop too. At this moment, unload and reload of modules also 
doesn't help.
* Then I reboot Machine-B (Including unplugging/replugging the cable). This 
should be equivalent to attaching another machine to the network.
* Still no help. Once ifup'ed, both the machine's kernels again start logging 
lots of messages.
* The only solution at that moment is to reboot both the machines. 

Comment 6 Stefan Richter 2006-09-01 14:02:16 UTC
Then this might be a third bug. Here is a hypothesis: All FireWire nodes with
LLC are identifiable by a persistent GUID alias EUI-64. Once Machine-B was
rebooted, it will be back under its old GUID as it is supposed to. Machine-A's
eth1394 interface to B's GUID however might still be inoperable due to the mess
that B's resume created before.

I suppose the currently necessary workaround is to unload ohci1394 on a machine
that is to be suspended.

Alas I know little about eth1394 and am quite busy right now, so I hesitate to
assign this bug to me. More diagnosis from you especially regarding eth1394
would be appreciated though. Thanks.
Comment 7 Stefan Richter 2006-09-01 22:35:53 UTC
PS: In case you also come to the conclusion that there may be a bug in eth1394
(concerning Machine-A's inability to get going again), please open another bug
with the respective logs. That way we can concentrate bug 7072 on ohci1394 alone.
Comment 8 Ritesh Raj Sarraf 2006-09-02 09:10:00 UTC
Currently, on Machine-B, at hibernation, I've configured the firewire 
interface to be put down and ohci1394 module to be removed. But that still 
doesn't help.

Upon resume, when I ifup the firewire interface, it is unusable.

Doing a ping result in the following kernel messages:

ohci1394: fw-host0: Unrecoverable error!
ohci1394: fw-host0: Async Req Tx Context died: ctrl[000088066] 
cmdptr[f7073048]
Comment 9 Stefan Richter 2006-09-02 09:40:29 UTC
Does this message occur on B or on A?
What if you unload all IEEE 1394 drivers on B before suspend?
Comment 10 Bernhard Kaindl 2006-09-06 08:29:57 UTC
Support for hibernate/suspend2disk is not implemented yet in ohci1394, and
therefore, you currently have to unload this module before hibernating. This
may of course a big problem if you find yourself in a situation where it is
difficult or even impossible to do this, e.g. if bugs in other some other
modules which have a dependancy on ohci1394 occur and prevent it from being
unloaded, but this is of course only one of many possible reasons. I took
some initiative now on this and had a first success, but as of this moment,
the implementation is not yet ready to be merged. Read the thread on:

http://sourceforge.net/mailarchive/forum.php?thread_id=30474986&forum_id=5389
Comment 12 Rafael J. Wysocki 2006-10-30 10:09:19 UTC
Does the lack of response mean the patches fix the issue?
Comment 13 Ritesh Raj Sarraf 2006-10-30 22:34:14 UTC
Not necessarily.
I couldn't test the patches because I was ill. Will test it soon (probably on 
the weekend) and update.
Comment 14 Ritesh Raj Sarraf 2006-11-04 06:20:12 UTC
Okay! I tested with the patches you mentioned.
ohci doesn't spit messages now but the network connectivity is still broken 
after hibernate and resume.
I also tried to ifdown and ifup the interface, but still it won't ping. And it 
won't log any kernel messages also.

I'm attaching the dmesg output which also contains the logs before and after 
hibernation.
I've also attached the lspci output.
Comment 15 Ritesh Raj Sarraf 2006-11-04 06:23:36 UTC
Created attachment 9402 [details]
dmesg output

dmesg output
Comment 16 Ritesh Raj Sarraf 2006-11-04 06:24:15 UTC
Created attachment 9403 [details]
lspci output

lspci output
Comment 17 Stefan Richter 2006-11-04 07:27:02 UTC
Thanks for the reports. The last log line from ohci1394 looks promising.

After resume, try the following (each one alone, and if necessary in combination):
- replug the FireWire cable
- unload and reload eth1394
- issue a bus reset with gscanbus (GUI program, menu item "force bus reset") or
with 1394commander (command line program, command "br")

If they are not available as a binary package, here are the sources:
http://gscanbus.berlios.de/
http://www.ict.tuwien.ac.at/ieee1394/opensource.html
1394commander from tuwien is easier to compile. gscanbus requires gtk1 and also
needs a patch from the patch tracker of the project page at berlios if you have
a newer gcc. (Hmm, maybe ieee1394 should get a sysfs attribute to inject bus
resets without those tools...)
Comment 18 Ritesh Raj Sarraf 2006-11-05 12:21:46 UTC
Hi Stefan,
Here's what you'd asked.

Upon cable replug on the server laptop:

ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[354fc0002226c838]
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ohci1394: fw-host0: AT dma reset ctx=0, aborting transmission
ieee1394: Error parsing configrom for node 0-00:1023
ieee1394: Error parsing configrom for node 0-01:1023
ieee1394: Node suspended: ID:BUS[0-01:1023]  GUID[00c09f00001f8a88]
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[354fc0002226c838]
ieee1394: Error parsing configrom for node 0-00:1023
ieee1394: Node resumed: ID:BUS[0-01:1023]  GUID[00c09f00001f8a88]


Upon eth1394 reload on the server laptop:

eth1394: eth1: IEEE-1394 IPv4 over 1394 Ethernet (fw-host0)
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-00:1023
ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[354fc0002226c838]

Upon gscanbus on the server laptop:
No messages were seen

I also tried the same steps on the other laptop that connects to the server 
laptop and got the same messages for cable replug and eth1394 reload.

On running gscanbus, I did get the following messages:

geeKISSexy:/var/www# gscanbus
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length
1/0x0000fffff0000400: wrong bus info block length


Also to my observation came that when I pinged from the server laptop to the 
client laptop (where ping failed), on the client laptop I got the following 
messages:

ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: Error parsing configrom for node 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023
ieee1394: ether1394 rx: sender nodeid lookup failure: 0-01:1023


Hope this helps.

PS: The patches you mentioned were applied only to the server laptop, and only 
the server laptop was suspended. The client laptop's kernel doesn't have the 
patches applied and it wasn't suspended (suspend is broken on it atm :-( )
Comment 19 Stefan Richter 2006-11-05 12:46:56 UTC
Thanks for the tests. Which one is the GUID of the card in the
suspending/resuming server? Is it 00c09f00001f8a88?
Comment 20 Stefan Richter 2006-11-05 12:52:33 UTC
(answering myself) Yes, it is, according to the dmesg output from comment #15.
Comment 21 Stefan Richter 2006-11-05 13:00:14 UTC
Thinking out loud: It appears the remote node cannot read the resumed server's
Configuration ROM. Software on the server should be able to read it (and
apparently is indeed able to do so) because local read and write requests bypass
the hardware.
Comment 22 Ritesh Raj Sarraf 2006-11-05 13:02:37 UTC
Yes, and here's some details taken from gscanbus for the card on the 
suspending/resuming server.

SelfID Info
-----------
Physical ID: 0
Link active: Yes
Gap Count: 63
PHY Speed: S400
PHY Delay: <=144ns
IRM Capable: Yes
Power Class: None
Port 0: Not connected
Init. reset: Yes

CSR ROM Info
------------
GUID: 0x00C09F00001F8A88
Node Capabilities: 0x000083C0
Vendor ID: 0x0000C09F
Unit Spec ID: 0x0000005E
Unit SW Version: 0x00000001
Model ID: 0x00000000
Nr. Textual Leafes: 1

Vendor:  QUANTA COMPUTER, INC.
Textual Leafes: 
Linux - ohci1394

AV/C Subunits
-------------
N/A


PS: The Port 0 status is "not connected" because I've removed the cable and 
put it in the shelf.
Comment 23 Stefan Richter 2006-12-17 04:03:01 UTC
Status note: The mentioned ohci1394 suspend/resume patches were merged upstream
for Linux 2.6.20-rc1. We still need to implement the proper restoration of the
configuration ROM and updating of the bus generation, and to test and fix all
high-level drivers as far as found necessary.
Comment 24 Stefan Richter 2007-01-07 13:00:06 UTC
Created attachment 10021 [details]
patch to let external nodes rediscover the resuming node

This fixes the bug as far as I could test:
  - gscanbus on a remote node is now able to fetch the config ROM of the
    resuming node.
  - sbp2 on the resuming node is now able to login back into a FireWire
    disk which stayed connected (and powered) during the suspend cycle.
    The SCSI device is the same as before suspend.

I did *not* test eth1394 or any other application yet.

You can get the patch also as part of my patchset v243 or later at
http://me.in-berlin.de/~s5r6/linux1394/updates/.
Comment 25 Stefan Richter 2007-01-07 13:13:50 UTC
PS: All of the previously mentioned patches are required too. If you have an
older kernel which doesn't contain them, take my patchset from me.in-berlin.de.
Comment 26 Stefan Richter 2007-01-08 09:23:45 UTC
Patch committed to linux1394-2.6.git, will send it to Linus after 2.6.20 was
released, i.e. for 2.6.21-rc1. Please reopen this bug entry if eth1394 still
doesn't work after resume.

Note You need to log in before you can comment on or make changes to this bug.