Bug 14029

Summary: OOPS: hrtimer_interrupt (raw DV device (DV camera) kills the machine)
Product: Drivers Reporter: Martin Mokrejs (mmokrejs)
Component: IEEE1394Assignee: drivers_ieee1394
Status: CLOSED UNREPRODUCIBLE    
Severity: normal CC: akpm, stefanr
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-rc6-git6 Subsystem:
Regression: No Bisected commit-id:
Attachments: how I resolve the OOPS
rash-rawdv-2.6.30.4.txt
crash-rawdv-2.6.31-rc5-git7.txt
resolved-2.6.30.4.txt
resolved-2.6.31-rc5-git7.txt
stacktrace-2.6.31-rc6-git6.txt
2.6.31-rc6-git6/.config triggering the bug
2.6.31-rc6-git6/.config NOT triggering the bug
live_kernel_stack_traces_without_the_PCMCIA_card_inserted.txt
live_kernel_stack_traces_with_the_PCMCIA_card_inserted.txt
another-stacktrace-2.6.31-rc6-git6.txt
yet-another-stacktrace-2.6.31-rc6-git6.txt
juju-stacktrace-2.6.31-rc6-git6.txt
another-juju-stacktrace-2.6.31-rc6-git6.txt
yet-another-juju-stacktrace-2.6.31-rc6-git6.txt
juju-stacktrace-2.6.31-rc6-git6_with_highres-off.txt
prelast-juju-stacktrace-2.6.31-rc6-git6_with_highres-off.txt
last-juju-stacktrace-2.6.31-rc6-git6_with_highres-off_nousb.txt
attempt45.tar.gz
2.6.31-rc6-git6-also-noacpi-nocpuidle-via-4-to-4-pin-and-pcmcia.txt
via-4-to-4-pin-and-pcmcia2.txt
via-4-to-4-pin-and-pcmcia3.txt
dmesg-via-4-to-4-pin-and-pcmcia4.txt
.config which killed ext3fs
.config with re-enabled fw sbp2 (my current)

Description Martin Mokrejs 2009-08-20 18:21:52 UTC
I want to capture some DV data via firewire 1394a. Kernels 2.6.30.4 and 2.6.31- are both affected. I will attach the stacktraces. I haven't been resolving a linux kernel OOPS for a while so to ensure you I will also attach what I did, showing that the bzImage I booted off is relevant to the vmlinux file.

The crashes had slighly different behavior. The one with 2.6.30.4 forced the machine to reboot. The one with 2.6.31-rc5-git7 just locked up the machine. No SysRq magic keys working. This is highly reproducible.
Comment 1 Martin Mokrejs 2009-08-20 18:23:27 UTC
Created attachment 22792 [details]
how I resolve the OOPS
Comment 2 Martin Mokrejs 2009-08-20 18:25:12 UTC
Created attachment 22793 [details]
rash-rawdv-2.6.30.4.txt

Triggered by kino(1) program, go into Capture mode. Sometimes that's already enough. If not, start capturing the stream. Mostly, the machine locks up, hard, no keyboard diodes flashing (I think). Once it rebooted the machine.
Comment 3 Martin Mokrejs 2009-08-20 18:25:42 UTC
Created attachment 22794 [details]
crash-rawdv-2.6.31-rc5-git7.txt
Comment 4 Martin Mokrejs 2009-08-20 18:32:33 UTC
Created attachment 22795 [details]
resolved-2.6.30.4.txt
Comment 5 Martin Mokrejs 2009-08-20 18:33:01 UTC
Created attachment 22796 [details]
resolved-2.6.31-rc5-git7.txt
Comment 6 Andrew Morton 2009-08-20 22:55:17 UTC
Stefan, couild you please take a peek at this one?

It's a stack overrun so it's hard to tell what caused it.  DV capture
is implicated though.

Martin, if you have CONFIG_4KSTACKS enabled then you could disable that.
This may well "fix" the problem.  We can then use CONFIG_DEBUG_STACKOVERFLOW
and CONFIG_DEBUG_STACK_USAGE to work out where things are going wrong.

Or there might be support for debugging stack usage problems in the new tracer code - we can ask Steven Rostedt and co to help out if so.

Thanks.
Comment 7 Martin Mokrejs 2009-08-21 11:13:48 UTC
$ grep CONFIG_4KSTACKS /usr/src/linux-2.6.30.4/.config
# CONFIG_4KSTACKS is not set
$ grep CONFIG_4KSTACKS /usr/src/linux-2.6.31-rc5-git7/.config
# CONFIG_4KSTACKS is not set
$ 

I will recompile the kernel to enable some more debug. So far I had:



$ grep CONFIG_DEBUG /usr/src/linux-2.6.31-rc5-git7/.config
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
# CONFIG_DEBUG_BUGVERBOSE is not set
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
# CONFIG_DEBUG_MEMORY_INIT is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_DEBUG_BOOT_PARAMS is not set



BTW, please apologize my messy report. I just forgot that I have the stacktraces 
resolved automtically and no need to run ksymoops at all.

I just do not understand why ksymoops gave those tons of warnings regarding the System.map file mismatching vmlinux, while at least from the timestamps on files they seemed to be same compilation run.
Comment 8 Stefan Richter 2009-08-21 16:38:34 UTC
Martin,
which hardware do you use?  (Camcorder, FireWire controller according to lspci.)
Did it work for you before with an older kernel or in a different hardware combination?
What do "grep OPTIMIZE .config" and "make checkstack | grep 1394" show?
Comment 9 Stefan Richter 2009-08-21 16:48:25 UTC
Furthermore, what if you unload ohci1394 and use firewire-ohci instead?
This requires you to
  - enable the newer FireWire stack in the kernel config,
  - use libraw1394 v2.x
    (latest is best; Gentoo has got it: sys-libs/libraw1394-2.0.4,
    revdep-rebuild if you had libraw1394 v1.x until now)
Comment 10 Stefan Richter 2009-08-21 16:50:19 UTC
> This requires you to
  - chmod the respective device files or upgrade udev rules,
    http://ieee1394.wiki.kernel.org/index.php/Juju_Migration
Comment 11 Martin Mokrejs 2009-08-21 20:36:11 UTC
I am trying to get a stacktrace out of 2.6.31-rc6-git6 (current). The machine
once rebooted without sending messages to remote ttyS0 console. At second attempt I copied some 15000 frames through the firewire and then it locked up but again did not send any single character to the remote console. "echo test >> /dev/ttyS0" works. For the third time I just booted up, and pressed Alt+SysRq+p and the machine locked with the following sent to the remote console:

test
test
test
SysRq : Emergency Sync
SysRq : Emergency Sync
SysRq : Emergency Sync
SysRq : Kill All Tasks
SysRq : Show Regs

Pid: 0, comm: swapper Not tainted (2.6.31-rc6-git6 #1) System Name
EIP: 0060:[<c11c01ac>] EFLAGS: 00000282 CPU: 0
EIP is at acpi_idle_enter_simple+0x11a/0x145
EAX: c14a5f94 EBX: 00000cab ECX: 00000355 EDX: 00000031
ESI: 00000000 EDI: f70cd448 EBP: c14a5fb4 ESP: c14a5f94
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
CR0: 8005003b CR2: b8067f02 CR3: 3692a000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Call Trace:
 [<c124cdb8>] cpuidle_idle_call+0x57/0x8b
 [<c1001b6d>] cpu_idle+0x1c/0x33
 [<c12f59dd>] rest_init+0x4d/0x4f
 [<c14a68b8>] start_kernel+0x217/0x21c
 [<c14a62d8>] i386_start_kernel+0x32/0x37
Comment 12 Martin Mokrejs 2009-08-21 21:04:54 UTC
(In reply to comment #8)
> Martin,
> which hardware do you use?  (Camcorder, FireWire controller according to
> lspci.)

The computer is ASUS L3C/S laptop, P4M-based with ICH3-M chipset. It has onboard
firewire chip, but I do not have the mini-connector on both sides of a cable. So I use PCMCIA Kouwell 7006 card with USB2.0 ports and FW1394.


> Did it work for you before with an older kernel or in a different hardware
> combination?

It did some 4 years ago with the same HW, yes. ;)
Comment 13 Martin Mokrejs 2009-08-21 21:10:42 UTC
Created attachment 22801 [details]
stacktrace-2.6.31-rc6-git6.txt

Finally a stacktrace from 2.6.31-rc6-git6. The machine rebooted itself but spit out some messages. I have manually unloaded dv1394 to ensure it does not interfere. I wonder why it talks about tcp stuff. Currently, it is not physically connected to the ethernet although it is enabled (had to move to a room with a computer with the serial console so no network connection).
Comment 14 Martin Mokrejs 2009-08-21 22:18:50 UTC
It is a Gentoo Linux ~x86 machine. I have installed libdc1394-2.1.0 (the wikipage you mention suggests 2.1.2) and libraw1394-2.0.4. The camera is Sony Handycam DCR-HC18E-PAL.
Comment 15 Martin Mokrejs 2009-08-22 07:45:00 UTC
Created attachment 22803 [details]
2.6.31-rc6-git6/.config triggering the bug
Comment 16 Martin Mokrejs 2009-08-22 07:46:40 UTC
(In reply to comment #8)
> What do "grep OPTIMIZE .config" and "make checkstack | grep 1394" show?

linux-2.6.31-rc6-git6 # grep OPTIMIZE .config
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
# CONFIG_OPTIMIZE_INLINING is not set
linux-2.6.31-rc6-git6 # make checkstack | grep 1394
0x00002d3b dma_rcv_tasklet [ohci1394]:                  304
0x00000a6c video1394_ioctl [video1394]:                 212
0x00007ed4 hpsb_default_host_entry [ieee1394]:          132
0x00001f05 state_connected [raw1394]:                   120
0x000007a1 ether1394_data_handler [eth1394]:            116
0x00000e43 build_speed_map [ieee1394]:                  108
0x000024cf ohci_iso_recv_task [ohci1394]:               104
0x00000e54 build_speed_map [ieee1394]:                  Dynamic (%eax)
linux-2.6.31-rc6-git6 #
Comment 17 Martin Mokrejs 2009-08-22 07:54:23 UTC
(In reply to comment #7)
> $ grep CONFIG_4KSTACKS /usr/src/linux-2.6.30.4/.config
> # CONFIG_4KSTACKS is not set
> $ grep CONFIG_4KSTACKS /usr/src/linux-2.6.31-rc5-git7/.config
> # CONFIG_4KSTACKS is not set
> $ 
> 
> I will recompile the kernel to enable some more debug. So far I had:

"Inadverently" I enabled in addition to many DEBUG variables

-# CONFIG_CC_STACKPROTECTOR is not set
+CONFIG_CC_STACKPROTECTOR_ALL=y
+CONFIG_CC_STACKPROTECTOR=y

and then both old firewire stack and the juju stack drivers worked (captured some
32 min of data). Probably not surprising for you. ;)
Comment 18 Stefan Richter 2009-08-22 08:42:28 UTC
> I wonder why it talks about tcp stuff.

Something involved in DV reception might have mistakenly written to the kernel
stack, and unrelated timer interrupts trip over it.

(Side note:  Here is a comment from Ingo Molnar on whether it is a stack
overflow or memory corruption:  http://lkml.org/lkml/2009/8/21/311)

> linux-2.6.31-rc6-git6 # make checkstack | grep 1394
> [...]

OK, this is close to what I get on a x86-32 box here too.

> "Inadverently" I enabled in addition to many DEBUG variables
> [...]
> and then both old firewire stack and the juju stack drivers worked

Could you attach the working .config as well?
Comment 19 Martin Mokrejs 2009-08-22 08:44:38 UTC
Created attachment 22804 [details]
2.6.31-rc6-git6/.config NOT triggering the bug
Comment 20 Martin Mokrejs 2009-08-22 14:39:00 UTC
Created attachment 22806 [details]
live_kernel_stack_traces_without_the_PCMCIA_card_inserted.txt
Comment 21 Martin Mokrejs 2009-08-22 14:39:35 UTC
Created attachment 22807 [details]
live_kernel_stack_traces_with_the_PCMCIA_card_inserted.txt
Comment 22 Martin Mokrejs 2009-08-22 15:09:39 UTC
Created attachment 22808 [details]
another-stacktrace-2.6.31-rc6-git6.txt

So, after disabling back the gcc config stack protect feature while having enabled the stack tracing I got one reboot with no OOPS on remote serial console. On a second attempt I got this. This is using the old firewire stack.
Comment 23 Martin Mokrejs 2009-08-22 15:21:36 UTC
Created attachment 22809 [details]
yet-another-stacktrace-2.6.31-rc6-git6.txt

This crash is funny. Mouse has stopped, but sound card spits out the sound recorded on the tape (quickly than it was in real time). The status line of kino(1) says "Waiting for DV 7 ...". Unplugging the firewire cable nor pulling out the PCMCIA card has changed anything. The computer still makes a sound. Weird. Disk diode does not flash, no Magic Keys working. Is it replaying some files? ;-) It was a fresh, warm boot.
Comment 24 Martin Mokrejs 2009-08-22 15:29:31 UTC
Created attachment 22810 [details]
juju-stacktrace-2.6.31-rc6-git6.txt

JuJu driver has the same problem. The machine locked just after going into "Capture" tab in kino(1). However, it rebooted after few seconds itself. Unfortunately partial stacktrace.
Comment 25 Martin Mokrejs 2009-08-22 15:34:08 UTC
Created attachment 22811 [details]
another-juju-stacktrace-2.6.31-rc6-git6.txt

Entered the "Capture" tab, saw some video frames but before even pressing "capture" machine locked and rebooted.
Comment 26 Martin Mokrejs 2009-08-22 15:40:40 UTC
Created attachment 22812 [details]
yet-another-juju-stacktrace-2.6.31-rc6-git6.txt

I passed hpet=off acpi=off on the kernel commandline but it did not help.
Comment 27 Martin Mokrejs 2009-08-22 16:00:43 UTC
Created attachment 22813 [details]
juju-stacktrace-2.6.31-rc6-git6_with_highres-off.txt

I added highres=off but hey, the machine rebooted again. Unfortunately the ftrace_dump_on_oops is too chatty so it not finish over the serial console so and is thus useless.
Comment 28 Martin Mokrejs 2009-08-22 16:15:32 UTC
Created attachment 22814 [details]
prelast-juju-stacktrace-2.6.31-rc6-git6_with_highres-off.txt

So here is the pre-last stacktrace when highres=off was passed to kernel with JuJu stack and no ftrace dump enabled.
Comment 29 Martin Mokrejs 2009-08-22 16:19:59 UTC
Created attachment 22815 [details]
last-juju-stacktrace-2.6.31-rc6-git6_with_highres-off_nousb.txt

Although I did rmmod ehci_hcd and uhci_hcd the machine rebooted but at least the stacktrace does not complain about stack corruption (booted with highres=off as previously). Maybe for the first time?

Sorry for all the spam, I will wait what you say before shooting blindly.
Comment 30 Stefan Richter 2009-08-23 11:37:09 UTC
Since isochronous receive DMA not only through ohci1394+ieee1394+raw1394 but also through firewire-ohci+firewire-core causes panics, it is highly unlikely that there is a driver problem.  Isochronous reception has been newly implemented from scratch in the latter, and it even uses a different DMA mode of the OHCI-1394 chip (different buffer management, different DMA programs).

So, either it's a hardware fault, or some other kernel component is buggy.  Try to simplify your system step by step to eliminate one component after another.

E.g. one starting step could be to use dvgrab (command line DV capture tool) rather than kino.  You can run it first in the same environment as you run kino (full X desktop), then with X shut down, sound drivers unloaded or unbound, networking drivers unloaded or unbound, etc. pp..

Also build test kernels with reduced features.  E.g. remove ACPI powermanagement options.  You can do so also in a bisection style, e.g. remove a whole bundle of features between two steps, and if that helps, bring back a few options.  (Just keep track of what set of features you changed between any two steps.)

----

Well, you actually already did come up with a narrow bad--good pair of configs, but I can't really make much sense of it:  From comment #19:

$ diff -u "attachment 22803 [details]" "attachment 22804 [details]"
@@ -1,7 +1,7 @@
 #
 # Automatically generated make config: don't edit
 # Linux kernel version: 2.6.31-rc6-git6
-# Fri Aug 21 13:26:12 2009
+# Sat Aug 22 02:23:52 2009
 #
 # CONFIG_64BIT is not set
 CONFIG_X86_32=y
@@ -45,7 +45,6 @@
 CONFIG_GENERIC_HARDIRQS=y
 CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
 CONFIG_GENERIC_IRQ_PROBE=y
-CONFIG_X86_32_LAZY_GS=y
 CONFIG_KTIME_SCALAR=y
 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
 CONFIG_CONSTRUCTORS=y

This is auto-selected dependent on CC_STACKPROTECTOR.

@@ -302,7 +301,8 @@
 # CONFIG_X86_PAT is not set
 # CONFIG_EFI is not set
 CONFIG_SECCOMP=y
-# CONFIG_CC_STACKPROTECTOR is not set
+CONFIG_CC_STACKPROTECTOR_ALL=y
+CONFIG_CC_STACKPROTECTOR=y
 # CONFIG_HZ_100 is not set
 # CONFIG_HZ_250 is not set
 # CONFIG_HZ_300 is not set

Huh?  This option is supposed to /cause/ a kernel panic when the stack was corrupted, not to /suppress/ a kernel panic.

@@ -347,7 +347,8 @@
 CONFIG_ACPI_CUSTOM_DSDT_FILE=""
 # CONFIG_ACPI_CUSTOM_DSDT is not set
 CONFIG_ACPI_BLACKLIST_YEAR=0
-# CONFIG_ACPI_DEBUG is not set
+CONFIG_ACPI_DEBUG=y
+# CONFIG_ACPI_DEBUG_FUNC_TRACE is not set
 CONFIG_ACPI_PCI_SLOT=y
 CONFIG_X86_PM_TIMER=y
 CONFIG_ACPI_CONTAINER=y

Could ACPI_DEBUG modify the behaviour or timings of the kernel's ACPI subsystem so that bad things aren't happening anymore, by chance?  (Even though actualy debug output is only emitted if also a kernel command line option is set.)

@@ -422,7 +423,7 @@
 # CONFIG_SCx200 is not set
 # CONFIG_OLPC is not set
 CONFIG_PCCARD=y
-# CONFIG_PCMCIA_DEBUG is not set
+CONFIG_PCMCIA_DEBUG=y
 CONFIG_PCMCIA=m
 CONFIG_PCMCIA_LOAD_CIS=y
 CONFIG_PCMCIA_IOCTL=y

Ditto, except s/ACPI/CardBus/.

@@ -814,7 +815,11 @@
 #
 # See the help texts for more information.
 #
-# CONFIG_FIREWIRE is not set
+CONFIG_FIREWIRE=m
+CONFIG_FIREWIRE_OHCI=m
+CONFIG_FIREWIRE_OHCI_DEBUG=y
+CONFIG_FIREWIRE_SBP2=m
+CONFIG_FIREWIRE_NET=m
 CONFIG_IEEE1394=m
 CONFIG_IEEE1394_OHCI1394=m
 # CONFIG_IEEE1394_PCILYNX is not set

As you already found out, enabling these options and even using these drivers instead of the 1394 ones is inconsequential to the bug.

@@ -825,7 +830,7 @@
 CONFIG_IEEE1394_RAWIO=m
 CONFIG_IEEE1394_VIDEO1394=m
 CONFIG_IEEE1394_DV1394=m
-CONFIG_IEEE1394_VERBOSEDEBUG=y
+# CONFIG_IEEE1394_VERBOSEDEBUG is not set
 # CONFIG_I2O is not set
 # CONFIG_MACINTOSH_DRIVERS is not set
 CONFIG_NETDEVICES=y

IEEE1394_VERBOSEDEBUG lets the 1394 stack emit rather massive log messages, hence I presume you ran most of your tests without this option and already found the kernel panicking with as well as without this option.

@@ -2252,7 +2257,8 @@
 # CONFIG_DEBUG_OBJECTS is not set
 # CONFIG_DEBUG_SLAB is not set
 # CONFIG_DEBUG_KMEMLEAK is not set
-# CONFIG_DEBUG_RT_MUTEXES is not set
+CONFIG_DEBUG_RT_MUTEXES=y
+CONFIG_DEBUG_PI_LIST=y
 # CONFIG_RT_MUTEX_TESTER is not set
 # CONFIG_DEBUG_SPINLOCK is not set
 # CONFIG_DEBUG_MUTEXES is not set

Looks inconsequential.

@@ -2263,7 +2269,7 @@
 # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
 # CONFIG_DEBUG_KOBJECT is not set
 # CONFIG_DEBUG_HIGHMEM is not set
-# CONFIG_DEBUG_BUGVERBOSE is not set
+CONFIG_DEBUG_BUGVERBOSE=y
 CONFIG_DEBUG_INFO=y
 # CONFIG_DEBUG_VM is not set
 # CONFIG_DEBUG_VIRTUAL is not set

Likely inconsequential, IMO.

@@ -2281,7 +2287,7 @@
 # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
 # CONFIG_FAULT_INJECTION is not set
 # CONFIG_LATENCYTOP is not set
-# CONFIG_SYSCTL_SYSCALL_CHECK is not set
+CONFIG_SYSCTL_SYSCALL_CHECK=y
 # CONFIG_DEBUG_PAGEALLOC is not set
 CONFIG_USER_STACKTRACE_SUPPORT=y
 CONFIG_HAVE_FUNCTION_TRACER=y

Looks inconsequential.

@@ -2310,6 +2316,7 @@
 # CONFIG_BLK_DEV_IO_TRACE is not set
 # CONFIG_MMIOTRACE is not set
 # CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
+# CONFIG_FIREWIRE_OHCI_REMOTE_DMA is not set
 # CONFIG_DYNAMIC_DEBUG is not set
 # CONFIG_DMA_API_DEBUG is not set
 # CONFIG_SAMPLES is not set

Inconsequential.
Comment 31 Martin Mokrejs 2009-08-23 12:42:54 UTC
(In reply to comment #30)

> -CONFIG_X86_32_LAZY_GS=y

> This is auto-selected dependent on CC_STACKPROTECTOR.

> @@ -302,7 +301,8 @@
>  # CONFIG_X86_PAT is not set
>  # CONFIG_EFI is not set
>  CONFIG_SECCOMP=y
> -# CONFIG_CC_STACKPROTECTOR is not set
> +CONFIG_CC_STACKPROTECTOR_ALL=y
> +CONFIG_CC_STACKPROTECTOR=y

> Huh?  This option is supposed to /cause/ a kernel panic when the stack was
> corrupted, not to /suppress/ a kernel panic.

Stefan, when I reverted this change and left out the stackprotector stuff the problem re-appeared, really. I used gcc-3.4.1 but now I have also 4.4.1 but could switch back using gcc-config to 3.4.1 if somebody wants me to (Gentoo Linux box).

The problem(s) is(are) not related nor avoided by massive firewire debug messages nor acpi or other debug.

I cannot interpret the stacktraces but unloading the usb drivers actually made the stack "uncorrupted", unlike in most previous cases. In this context I have to add that I just placed a comment to some Gentoo Linux but that my uhci-hcd was loaded before ehci-hcd. I was getting a Warning in dmesg(1) that it should be vice-versa. Oh, here is what I did to fix the Warning but haven't tried yet the DV-testcase ;):

$ tail /etc/modprobe.conf
# make sure this does not happen anymore: "Warning! ehci_hcd should always be loaded before uhci_hcd and ohci_hcd, not after"
# fix from http://bugs.gentoo.org/show_bug.cgi?id=260139
install uhci-hcd /sbin/modprobe ehci-hcd ; /sbin/modprobe -i uhci-hcd 
install ohci-hcd /sbin/modprobe ehci-hcd ; /sbin/modprobe -i ohci-hcd 
$

The "nousb" case was not the first time the stacktrace was not automagically tagged as corrupted (more below). Maybe several bugs co-exist? Please note I did not use "nousb" as a kernel commandline but really unloaded the two drivers on the fly (just in case it would really matter).


Stacks obtained and NOT tagged as corrupted (name of the crash file is followed by relevant kernel commandline used to boot the machine and with a strack_trace file captured before running kino or entering the "Capture" mode,
acpi was compiled in statically):

another-juju-stacktrace-2.6.31-rc6-git6.txt

Kernel command line: root=/dev/sda3 console=ttyS0,115200n8 console=tty0 idebus=66 probe_mask=0x3f closcksource=acpi_pm,notsc stacktrace udev


        Depth    Size   Location    (42 entries)
        -----    ----   --------
  0)     2388      32   getnstimeofday+0x53/0xd7
  1)     2356      16   ktime_get_ts+0x25/0x49
  2)     2340      32   ktime_get+0x15/0x33
  3)     2308      16   sched_clock_tick+0x45/0x6e
  4)     2292      24   scheduler_tick+0x19/0xcc
  5)     2268      16   update_process_times+0x3e/0x49
  6)     2252      48   tick_sched_timer+0x154/0x17a
  7)     2204      20   __run_hrtimer+0x3a/0x5d
  8)     2184      44   hrtimer_interrupt+0xe9/0x13d
  9)     2140       8   timer_interrupt+0x1a/0x23
 10)     2132      32   handle_IRQ_event+0x52/0xf1
 11)     2100      16   handle_level_irq+0x5c/0x92
 12)     2084      20   handle_irq+0x40/0x4c
 13)     2064      24   do_IRQ+0x39/0x74
 14)     2040     120   common_interrupt+0x29/0x30
 15)     1920     432   extract_buf+0x6b/0xba
 16)     1488      44   extract_entropy+0x44/0x8a
 17)     1444      16   get_random_bytes+0x1a/0x1e
 18)     1428      16   rt_cache_invalidate+0x1b/0x2a
 19)     1412      24   rt_cache_flush+0x15/0x9c
 20)     1388      32   fib_inetaddr_event+0x19d/0x1aa
 21)     1356      28   notifier_call_chain+0x26/0x48
 22)     1328      36   __blocking_notifier_call_chain+0x3c/0x51
 23)     1292      16   blocking_notifier_call_chain+0x11/0x13
 24)     1276      28   __inet_insert_ifa+0xfe/0x109
 25)     1248      48   inetdev_event+0x142/0x32c
 26)     1200      28   notifier_call_chain+0x26/0x48
 27)     1172      16   raw_notifier_call_chain+0x11/0x13
 28)     1156       8   call_netdevice_notifiers+0x16/0x18
 29)     1148      16   dev_open+0xb5/0xbe
 30)     1132      28   dev_change_flags+0x9b/0x14a
 31)     1104      52   do_setlink+0x247/0x2e3
 32)     1052     244   rtnl_newlink+0x299/0x410
 33)      808      40   rtnetlink_rcv_msg+0x18d/0x1a3
 34)      768      20   netlink_rcv_skb+0x35/0x7d
 35)      748      12   rtnetlink_rcv+0x20/0x27
 36)      736      36   netlink_unicast+0x18f/0x1ef
 37)      700      64   netlink_sendmsg+0x21d/0x22a
 38)      636     216   sock_sendmsg+0xcb/0xe1
 39)      420     292   sys_sendmsg+0x14e/0x19b
 40)      128      48   sys_socketcall+0x147/0x170
 41)       80      80   sysenter_do_call+0x12/0x26




yet-another-juju-stacktrace-2.6.31-rc6-git6.txt

Kernel command line: root=/dev/sda3 console=ttyS0,115200n8 console=tty0 idebus=66 probe_mask=0x3f closcksource=acpi_pm,notsc stacktrace ftrace_dump_on_oops highres=off udev

        Depth    Size   Location    (32 entries)
        -----    ----   --------
  0)     2192      20   check_preempt_wakeup+0x1a/0xbb
  1)     2172      44   try_to_wake_up+0x131/0x145
  2)     2128       8   wake_up_state+0xf/0x11
  3)     2120       8   signal_wake_up+0x22/0x24
  4)     2112      28   complete_signal+0x171/0x189
  5)     2084      32   T.642+0x1a9/0x1bd
  6)     2052      12   __group_send_sig_info+0xf/0x11
  7)     2040      24   group_send_sig_info+0x41/0x4c
  8)     2016     184   send_sigio+0x12c/0x179
  9)     1832      20   __kill_fasync+0x43/0x52
 10)     1812       8   kill_fasync+0x13/0x15
 11)     1804      48   evdev_event+0x6d/0x101
 12)     1756      32   input_pass_event+0x28/0x70
 13)     1724      44   input_handle_event+0x371/0x37a
 14)     1680      28   input_event+0x43/0x51
 15)     1652      56   synaptics_process_byte+0x4ae/0x669
 16)     1596      12   psmouse_handle_byte+0x11/0xc3
 17)     1584      20   psmouse_interrupt+0x20d/0x21c
 18)     1564      16   serio_interrupt+0x1a/0x3a
 19)     1548      48   i8042_interrupt+0x1cd/0x1de
 20)     1500      32   handle_IRQ_event+0x52/0xf1
 21)     1468      16   handle_level_irq+0x5c/0x92
 22)     1452      20   handle_irq+0x40/0x4c
 23)     1432      24   do_IRQ+0x39/0x74
 24)     1408      96   common_interrupt+0x29/0x30
 25)     1312      24   ftrace_call+0x5/0x8
 26)     1288      84   schedule_hrtimeout_range+0xb4/0xf4
 27)     1204      16   poll_schedule_timeout+0x2c/0x43
 28)     1188     752   do_select+0x4a9/0x4f0
 29)      436     316   core_sys_select+0x1bf/0x272
 30)      120      40   sys_select+0x6f/0x8b
 31)       80      80   sysenter_do_call+0x12/0x26




last-juju-stacktrace-2.6.31-rc6-git6_with_highres-off_nousb.txt

Kernel command line: root=/dev/sda3 console=ttyS0,115200n8 console=tty0 idebus=66 probe_mask=0x3f closcksource=acpi_pm,notsc stacktrace highres=off udev

        Depth    Size   Location    (50 entries)
        -----    ----   --------
  0)     2260      24   enqueue_task_fair+0x2b/0xad
  1)     2236      16   T.1248+0x31/0x3c
  2)     2220       8   T.1250+0x20/0x28
  3)     2212      44   try_to_wake_up+0x4f/0x145
  4)     2168       8   default_wake_function+0x10/0x12
  5)     2160      20   autoremove_wake_function+0x14/0x34
  6)     2140      36   __wake_up_common+0x36/0x5d
  7)     2104      20   __wake_up+0x16/0x1f
  8)     2084      32   insert_work+0x7c/0x85
  9)     2052      12   queue_work_on+0x31/0x3b
 10)     2040       8   queue_work+0x13/0x15
 11)     2032       8   kblockd_schedule_work+0x12/0x14
 12)     2024      56   cfq_completed_request+0x269/0x271
 13)     1968      16   elv_completed_request+0x47/0x9e
 14)     1952      20   __blk_put_request+0x25/0x94
 15)     1932      28   blk_finish_request+0x147/0x14e
 16)     1904      20   blk_end_bidi_request+0x2c/0x38
 17)     1884      12   blk_end_request+0xf/0x11
 18)     1872      60   scsi_io_completion+0x164/0x379
 19)     1812      24   scsi_finish_command+0x96/0x9c
 20)     1788      24   scsi_softirq_done+0xdd/0xe5
 21)     1764      24   blk_done_softirq+0x57/0x64
 22)     1740      36   __do_softirq+0x93/0x132
 23)     1704      12   do_softirq+0x2a/0x2f
 24)     1692       8   irq_exit+0x2d/0x2f
 25)     1684      24   do_IRQ+0x61/0x74
 26)     1660     116   common_interrupt+0x29/0x30
 27)     1544     132   generic_make_request+0x365/0x3a0
 28)     1412      48   submit_bio+0x88/0x8f
 29)     1364      32   submit_bh+0xf1/0x111
 30)     1332      12   __bread+0x54/0x82
 31)     1320      36   ext3_get_branch+0x66/0xcf
 32)     1284     192   ext3_get_blocks_handle+0x84/0x6f2
 33)     1092      56   ext3_get_block+0x91/0xcd
 34)     1036     192   do_mpage_readpage+0x28a/0x5db
 35)      844     124   mpage_readpages+0x9c/0xd3
 36)      720      12   ext3_readpages+0x19/0x1b
 37)      708      52   __do_page_cache_readahead+0xd9/0x14f
 38)      656      20   ra_submit+0x1c/0x21
 39)      636      44   filemap_fault+0x15e/0x2cd
 40)      592      68   __do_fault+0x40/0x2ec
 41)      524      68   handle_mm_fault+0x1d9/0x402
 42)      456      48   do_page_fault+0x252/0x268
 43)      408      80   error_code+0x5e/0x64
 44)      328       8   padzero+0x1e/0x2d
 45)      320     144   load_elf_binary+0x5d0/0xfe1
 46)      176      36   search_binary_handler+0x6f/0x1c4
 47)      140      36   do_execve+0x1ae/0x29d
 48)      104      24   sys_execve+0x2b/0x53
 49)       80      80   sysenter_do_call+0x12/0x26







For completeness I am attaching the stack_trace stats when the old firewire stack was used. This should be matching attempts in   another-stacktrace-2.6.31-rc6-git6.txt or yet-another-stacktrace-2.6.31-rc6-git6.txt cases in which acpi symbols were in the stack dump although the traces were corrupted.

Kernel command line: root=/dev/sda3 console=ttyS0,115200n8 console=tty0 idebus=66 probe_mask=0x3f closcksource=acpi_pm,notsc stacktrace udev

        Depth    Size   Location    (58 entries)
        -----    ----   --------
  0)     2576      24   enqueue_task_fair+0x2b/0xad
  1)     2552      16   T.1248+0x31/0x3c
  2)     2536       8   T.1250+0x20/0x28
  3)     2528      44   try_to_wake_up+0x4f/0x145
  4)     2484       8   wake_up_state+0xf/0x11
  5)     2476       8   signal_wake_up+0x22/0x24
  6)     2468      28   complete_signal+0x171/0x189
  7)     2440      32   T.642+0x1a9/0x1bd
  8)     2408      12   __group_send_sig_info+0xf/0x11
  9)     2396      24   group_send_sig_info+0x41/0x4c
 10)     2372     184   send_sigio+0x12c/0x179
 11)     2188      20   __kill_fasync+0x43/0x52
 12)     2168       8   kill_fasync+0x13/0x15
 13)     2160      48   evdev_event+0x6d/0x101
 14)     2112      32   input_pass_event+0x28/0x70
 15)     2080      44   input_handle_event+0x371/0x37a
 16)     2036      28   input_event+0x43/0x51
 17)     2008      56   synaptics_process_byte+0x4ae/0x669
 18)     1952      12   psmouse_handle_byte+0x11/0xc3
 19)     1940      20   psmouse_interrupt+0x20d/0x21c
 20)     1920      16   serio_interrupt+0x1a/0x3a
 21)     1904      48   i8042_interrupt+0x1cd/0x1de
 22)     1856      32   handle_IRQ_event+0x52/0xf1
 23)     1824      16   handle_level_irq+0x5c/0x92
 24)     1808      20   handle_irq+0x40/0x4c
 25)     1788      24   do_IRQ+0x39/0x74
 26)     1764      80   common_interrupt+0x29/0x30
 27)     1684      44   scsi_request_fn+0x29b/0x378
 28)     1640      12   __blk_run_queue+0x3a/0x5c
 29)     1628      12   blk_run_queue+0x11/0x16
 30)     1616      52   scsi_run_queue+0x1a9/0x1f6
 31)     1564      20   scsi_next_command+0x2d/0x39
 32)     1544      60   scsi_io_completion+0x1a9/0x379
 33)     1484      24   scsi_finish_command+0x96/0x9c
 34)     1460      24   scsi_softirq_done+0xdd/0xe5
 35)     1436      24   blk_done_softirq+0x57/0x64
 36)     1412      36   __do_softirq+0x93/0x132
 37)     1376      12   do_softirq+0x2a/0x2f
 38)     1364       8   irq_exit+0x2d/0x2f
 39)     1356      24   do_IRQ+0x61/0x74
 40)     1332     116   common_interrupt+0x29/0x30
 41)     1216     132   generic_make_request+0x365/0x3a0
 42)     1084      48   submit_bio+0x88/0x8f
 43)     1036      32   submit_bh+0xf1/0x111
 44)     1004      12   __bread+0x54/0x82
 45)      992      36   ext3_get_branch+0x66/0xcf
 46)      956     192   ext3_get_blocks_handle+0x84/0x6f2
 47)      764      56   ext3_get_block+0x91/0xcd
 48)      708     192   do_mpage_readpage+0x28a/0x5db
 49)      516     124   mpage_readpages+0x9c/0xd3
 50)      392      12   ext3_readpages+0x19/0x1b
 51)      380      52   __do_page_cache_readahead+0xd9/0x14f
 52)      328      20   ra_submit+0x1c/0x21
 53)      308      44   filemap_fault+0x15e/0x2cd
 54)      264      68   __do_fault+0x40/0x2ec
 55)      196      68   handle_mm_fault+0x1d9/0x402
 56)      128      48   do_page_fault+0x252/0x268
 57)       80      80   error_code+0x5e/0x64




I have disabled the highres timer now in the kernel but haven't tested that yet.
The "highres=off" on commandline apparently did not disable it or at least not completely, as the "highres" string still appeared in stacktraces.

I thought I did try noapci or acpi=off but according to the dmesg outputs I have stored I did not or it got ignored.

Before trying more somebody please let me know whether you can gather something from the non-corrupted stack traces.
Comment 32 Stefan Richter 2009-08-23 17:37:17 UTC
I at least have no further idea at this point.
Comment 33 Martin Mokrejs 2009-08-24 09:13:53 UTC
First of all, I dug out an old kernel image 2.6.29-rc7, still the machine rebooted during the test. Same for 2.6.29.1.

I fixed /etc/modprobe.conf so that the USB drivers get loaded in proper order. The "Warning" message disappeared from dmesg(1) but no improvement in terms of the firewire issue.

I reproduced with "dvgrab fileprefix-". So, I shrunk down the kernel options. At one point I thought that removal od dr+agp+backlight_lcd+fb+parport+i2c did help (I got PANIC: doublefault message but was not on serial console), but re-adding all these back I could not find out which was the cause I got crashes until I reached exactly same .config (as judged by diff(1)). :(

But, since that time the machine throws into the console (the one on tty at least) a message that a damaged firewire packet was received.

I shrunk further. I suspected that when yenta and serial share an interrupt it could be a problem. I dropped even serial, mouse, MMIO mapped packets, special init code for Ricoh bridges ... Still, the machine rebooted after spitting out the message about a defective packet. Note that HPET, HIGHRES, watchdog, cpufreq  are not in my kernel anymore.


At the very moment I boot into initlevel 2 as dropping network, sockets etc. causes my X session to ignore my keyboard. I will attach the .config I have and some dmesg and other info. The problem can be even now triggered by "dvgrab -status". It prints the message about broken packet, a .dv file is created on the disk but sometimes the machine does not reboot. As magic keys worked I obtained some dumps.

I haven't tried with the very last config but few steps ago I tried the JuJu stack and got same results - damaged packet notes and reboot.


Questions to you: 
1. The machine has onboard firewire chipset with two mini connectors. As I lost the proper cable I use PCMCIA card with 2 USB2.0 and and 2 firewire sockets (one is the mini sized and the other the big thing - which I use). dmesg(1) outputs show that the firewire controllers are 1.0 and 1.1 spec. Further, their max packet sizes are 1024 and 2048. Could that fool the driver? How can I disable the onboard chip?

2. The damaged packet(s) output I could post as a .jpg file from a camera. dvgrab stopped the camera but although it showed several packet timestamps it spoke only about a single packet?

3. Sometimes in the dmesg(1) output I saw a note:
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
Not always but might give you a clue.

4. The PCMCIA card has an optional input for external power. I do not have it.

5. The camera is powered from a battery, sorry, getting physical access to the power adapter will take months. The firewire cable is good, thick and shielded. I use it for my external firewire sound card without problems. I do have few more these cables with the big and mini connectors (but not the one with two mini connectors).
Comment 34 Martin Mokrejs 2009-08-24 09:28:48 UTC
Created attachment 22827 [details]
attempt45.tar.gz

BTW, during my test once I did use "acpi=off" but that did not help. So, at the moment I still do have some acpi stuff in the kernel. Please advise what else I could drop from .config.
Comment 35 Stefan Richter 2009-08-24 16:24:40 UTC
> BTW, during my test once I did use "acpi=off" but that did not help.
> So, at the moment I still do have some acpi stuff in the kernel.
> Please advise what else I could drop from .config.

Could you nevertheless do one last test with a kernel with CONFIG_ACPI=n? Besides that, I don't spot anything interesting to switch off.

You did already test with dvgrab on a text console, X11 shut down, right?
I.e. after /etc/init.d/xdm stop.

> I haven't tried with the very last config but few steps ago I tried
> the JuJu stack and got same results - damaged packet notes and reboot.

Your last tests with minimal kernels bring the DV reception DMA back into the picture as culprit.  Still, what's curious is that (1.) both driver stacks trigger the bug even though they use very different DMA modes of the controller and don't share code in this area, (2.) your hardware/ software environment is pretty common yet I don't remember to have heard of similar crashes.

> Questions to you: 
> 1. The machine has onboard firewire chipset with two mini connectors.
> As I lost the proper cable I use PCMCIA card with 2 USB2.0 and and 2
> firewire sockets (one is the mini sized and the other the big thing -
> which I use). dmesg(1) outputs show that the firewire controllers are
> 1.0 and 1.1 spec. Further, their max packet sizes are 1024 and 2048.

That's quite common.

> Could that fool the driver?

No.  Some people use two or even more controllers regularly and extensively, alone and together, same ones or different ones.  I for one have a box with up to four different 1394 controllers present at a time, among them one out of a few CardBus cards which I have here.

> How can I disable the onboard chip?

If you don't have a BIOS menu for that, then you can either unbind it:
# lspci | grep 1394
# ls /sys/module/ohci1394/drivers/pci\:ohci1394/
# echo -n "0000:04:00.0" > /sys/module/ohci1394/drivers/pci\:ohci1394/unbind
(Use the device ID from the sysfs listing, not the one from lspci.)

Or you can hack linux/drivers/ieee1394/ohci1394.c ohci1394_pci_probe to return early with error if dev->vendor matches the vendor ID (from lspci -nn) of the built-in controller.

BTW, a 1394 controller which is idle will only do a single brief DMA right after initialization (8 or 12 bytes or so).  If driven by firewire-ohci instead of ohci1394, it will also emit an interrupt event every 64 seconds (signaling the wrapping of a bus timer, causing the CPU to update a kernel variable and a chip register).  IOW an idle 1394 controller is almost like absent.

> 2. The damaged packet(s) output I could post as a .jpg file from a
> camera. dvgrab stopped the camera but although it showed several packet
> timestamps it spoke only about a single packet?

I'm not familiar with dvgrab's integrity testing or how it bails out when it detects data errors.

The corrupt packet, as logged by dvgrab, could be because the camcorder sent junk, or because the controller wrote junk into memory, or because the drivers chased wrong buffer pointers after a memory corruption.

BTW, if the camcorder already sent garbage (or the cable corrupted some data), then the controller should still _not_ overwrite random memory (it is programmed by the drivers to DMA into the designated buffers only, not anywhere else); or more likely in such a case it should detect a CRC error and pass just this error status up to the application.

However, if your CardBus card (or the CardBus bridge) is buggy or damaged, then all bets are off.

> 3. Sometimes in the dmesg(1) output I saw a note:
> ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
> Not always but might give you a clue.

That's normal when a 1394 node with lesser bus management capabilities is plugged in.  Management of the 1394 bus (as a peer-to-peer or network-like bus) is voluntarily taken over by nodes whose firmware or OS is most capable to do so; or if several of such nodes are present, they have a protocol to select one who does the work.  "IRM" = isochronous resource manager.

> 4. The PCMCIA card has an optional input for external power. I do not
> have it.

No problem.  This power input is only to inject bus power, it does not affect data signals.  Camcorders don't need 1394 bus power.

> 5. The camera is powered from a battery, sorry, getting physical access
> to the power adapter will take months.

Wouldn't make a difference.

> The firewire cable is good, thick and shielded. I use it for my external
> firewire sound card without problems. I do have few more these cables
> with the big and mini connectors (but not the one with two mini
> connectors).

A 4-pin to 4-pin cable (or, 2nd choice, a 4-pin plug to 6-pin socket adapter) would be cool to test how well the built-in controller works in contrast to the CardBus card.  Or maybe you find a local computer shop with friendly service who let you try out another FireWire CardBus card.
Comment 36 Stefan Richter 2009-08-24 16:27:17 UTC
> I use it for my external firewire sound card without problems.

Do you use this sound card on the same notebook and with this CardBus card?
Comment 37 Stefan Richter 2009-08-24 16:27:44 UTC
...and for audio reception?
Comment 38 Martin Mokrejs 2009-08-24 16:56:33 UTC
(In reply to comment #36)
> > I use it for my external firewire sound card without problems.
> 
> Do you use this sound card on the same notebook and with this CardBus card?

It is a Phonic Firefly box with low-latency drivers available only on MS Windows.
The laptop is dual-boot. Under WinXP I recorded a lot using just the same hw
and cable(s).

BTW, have external CD-RW/DVD-RW on firewire as well. I could try to burn on it something - I thing it used to work under linux.

Does the dvgrab status really read the tape remotely? I do not think so as the DV-camera is silent. I bet it just talks to the chip over the wire.

Why is the massive firewire debug option useless here? I disabled that in the past as I never saw anything useful. :( Maybe time to retry in this minimalistic setup (yes, am not using X nor framebuffer now when running dvgrab).
Comment 39 Martin Mokrejs 2009-08-24 22:00:40 UTC
Created attachment 22832 [details]
2.6.31-rc6-git6-also-noacpi-nocpuidle-via-4-to-4-pin-and-pcmcia.txt

This crash or the next one for which I did not have a stacktrace killed my ext3fs. Something like "group counts went wrong". Inodes of files I haven't touched for a while were wrong, e.g. some attribute saying it is compressed file although the ext3fs does not allow it, some ugly huge numbers ... A recovery CD-ROM boot corrected the errors but I think I will stop inspecting this.

It only appears that it is related either to the firewoire chip on the pcmcia casrd itself or to the pmcia driver. The on-motherboard firewire chipset works fine. Kernel crashes can be triggered by "dvgrab -showstatus" as well as grabbing the stream. There are some many different stack traces that I doubt it makes sense to continue on this.

The "damaged firewire packets" were caused by the fact the camera was in "camera mode" instead of "play/edit mode".

I just do not believe in a hardware issue as the same ports on the pmcia card work well under win xp sp2.
Comment 40 Martin Mokrejs 2009-08-24 22:32:52 UTC
I have attached the external CD/DVD drive via the firewire cable through the pmcia card. I managed to copy whole cdrom into /dev/null. The firewire sbp2 driver of the old stack (not JuJu) works. The chip and the cables as well.

So, it looks like a yenta driver issue.

I don't know why the last stacktrace talks about swap. As I said, I booted into initlevel 2, but yes, at that point swap is really enabled. Have 2 GB RAM and same of swap. I suspect I am hitting too many bugs in a row.
Comment 41 Martin Mokrejs 2009-08-24 22:42:24 UTC
Created attachment 22833 [details]
via-4-to-4-pin-and-pcmcia2.txt
Comment 42 Martin Mokrejs 2009-08-24 22:42:55 UTC
Created attachment 22834 [details]
via-4-to-4-pin-and-pcmcia3.txt
Comment 43 Martin Mokrejs 2009-08-24 22:45:05 UTC
Created attachment 22835 [details]
dmesg-via-4-to-4-pin-and-pcmcia4.txt

A sample dmesg output when the DV camera is connected to the pcmcia firewire port with the kernel without acpi and cpu idle support. I will post .config.
Comment 44 Martin Mokrejs 2009-08-25 06:45:18 UTC
Created attachment 22839 [details]
.config which killed ext3fs
Comment 45 Martin Mokrejs 2009-08-25 06:46:15 UTC
Created attachment 22840 [details]
.config with re-enabled fw sbp2 (my current)
Comment 46 Stefan Richter 2009-08-25 13:24:31 UTC
> The laptop is dual-boot. Under WinXP I recorded a lot using just the
> same hw and cable(s).

This means that the Linux kernel is buggy or that only the Linux OS + apps trigger a hardware fault.

> I have attached the external CD/DVD drive via the firewire cable
> through the pmcia card. I managed to copy whole cdrom into /dev/null.

(1.) asynchronous I/O (through the old or new 1394 stack) as used with storage devices, (2.) isochronous (e.g. DV) reception through the old stack, (3.) isochronous reception through the new stack all use different DMA modes of the 1394 controller.  Cases (2.) and (3.) are the same though WRT what's coming in over the wire.  To reiterate, cases (2.) and (3.) don't share driver code.

Furthermore, (2.) and (3.) cause a higher IRQ and tasklet load than (1.), because the bulk of (1.) works without involvement of the CPU.  Nevertheless it's not entirely expected that (2.) and (3.) crash while (1.) doesn't.

During (2.) and (3.) there is also some interleaved asynchronous traffic for control and status of the camera, similar in nature to control and status traffic in case of (1.).

----

Anyway, despite all your additional diagnostics I have alas still no idea what causes the bug or why CONFIG_CC_STACKPROTECTOR=y prevents the bug from happening.
Comment 47 Stefan Richter 2009-08-25 14:24:32 UTC
> Kernel crashes can be triggered by "dvgrab -showstatus" as well
> as grabbing the stream.

Do you mean by the former, that the camera not only doesn't play back, it also doesn't transmit from its camera sensor at this time?  (Or more to the point, dvgrab doesn't receive & record frames in this mode?)
Comment 48 Stefan Richter 2009-08-25 14:28:02 UTC
PS:  I ask because one can also capture live video from many camcorders (in contrast to capture from tape), yet those two ways of video capture are the same from the receiving end's POV.  I.e. do you have even live video off but the kernel crashes nevertheless while dvgrab is active?
Comment 49 Martin Mokrejs 2009-08-25 15:21:20 UTC
(In reply to comment #47)
> > Kernel crashes can be triggered by "dvgrab -showstatus" as well
> > as grabbing the stream.
> 
> Do you mean by the former, that the camera not only doesn't play back, it
> also
> doesn't transmit from its camera sensor at this time?  (Or more to the point,
> dvgrab doesn't receive & record frames in this mode?)

The .dv files were always non-empty. I only bothered to once run 'mplayer -vo null' to ensure the format was readable. Back to you question. I just wanted to say that just the communication over the wire kills the machine. And while taking into account that at some test I had the machine set to the camera mode instead of the play mode the camera did not send real video data but complained. And that is why dvgrab complained about damaged packets. Foolingly enough, dvgrab did create (as I said) non-empty files, readable by mplayer and probably containing just black frames. This matches the picture that 'any' communication with the device kills the machine. It happens within first say 5 seconds after the attempt, mostly within the first second.

I can try to capture a live video. And will test from win xp.

Is there a patch available so that the firewire driver would just receive data while not writing it into memory/file, whatever it really does? It might save my possible future ext3fs problems and might be enough to stress the IRQ lines.

Regarding the ext3 crash: as I rebooted a lot and my fsck is scheduled every 30th mount of the filesystem I can assure you that the filesystem before the crash was clean about 5 kernel crashes ago. Just to emphasize that replaying the journal either resulted in broken inode attributes and metadata or that something wrote directly over the disk area?
Comment 50 Stefan Richter 2009-08-26 15:06:33 UTC
> I can try to capture a live video.

Not necessary.  You already confirmed in comment 49 what I wanted to know:  In all crashing cases, OHCI-1394 IR DMA is active.

> Is there a patch available so that the firewire driver would just
> receive data while not writing it into memory/file, whatever it
> really does? It might save my possible future ext3fs problems

No.  These crashes obviously involve memory corruption (stack corruption in particular), which means that your on-disk data are in danger since the kernel may write random junk.  You merely can reduce the risk somewhat by reducing regular I/O, e.g. by "dvgrab - >/dev/null".  (Trailing "-" in dvgrab's command line means output to stdout.)

However, at this point I don't know what'd be left to test anyway...  Hmm, perhaps one thing if you have the means:  A different CardBus card, as an indicator whether the problem relates more to the card or more to the laptop's chipset, especially its CardBus bridge.  But would it be worth it?  So far the whole thing sounds to me as if nobody else is affected, and that no other kernel developer will pop up with an idea what to fix.  IOW the only realistic way forward for this bug seems to me to be "resolved; wontfix/ worksforme" for now, sorry.
Comment 51 Martin Mokrejs 2009-10-01 16:40:34 UTC
I have enabled ISA BUS support in my kernel. I thought my laptop uses only PCI but it seems at least something got detected:

--- old 2009-10-01 18:38:01.000000000 +0200
+++ new 2009-10-01 18:37:51.000000000 +0200
@@ -1,13 +1,16 @@
 yenta_cardbus 0000:02:07.0: ISA IRQ mask 0x0490, PCI irq 5
-yenta_cardbus 0000:02:07.0: Socket status: 30000820
+yenta_cardbus 0000:02:07.0: Socket status: 30000006
 pci_bus 0000:02: Raising subordinate bus# of parent bus (#02) from #02 to #06
 yenta_cardbus 0000:02:07.0: pcmcia: parent PCI bridge I/O window: 0xa000 - 0xafff
+pcmcia_socket pcmcia_socket0: cs: IO port probe 0xa000-0xafff: clean.
 yenta_cardbus 0000:02:07.0: pcmcia: parent PCI bridge Memory window: 0xd6000000 - 0xd6ffffff
-yenta_cardbus 0000:02:07.0: pcmcia: parent PCI bridge Memory window: 0x80000000 - 0x87ffffff
+yenta_cardbus 0000:02:07.0: pcmcia: parent PCI bridge Memory window: 0x88000000 - 0x8fffffff
 yenta_cardbus 0000:02:07.1: CardBus bridge found [1043:1624]
 yenta_cardbus 0000:02:07.1: ISA IRQ mask 0x0490, PCI irq 11
 yenta_cardbus 0000:02:07.1: Socket status: 30000006
 pci_bus 0000:02: Raising subordinate bus# of parent bus (#02) from #06 to #0a
 yenta_cardbus 0000:02:07.1: pcmcia: parent PCI bridge I/O window: 0xa000 - 0xafff
+pcmcia_socket pcmcia_socket1: cs: IO port probe 0xa000-0xafff: clean.
 yenta_cardbus 0000:02:07.1: pcmcia: parent PCI bridge Memory window: 0xd6000000 - 0xd6ffffff
-yenta_cardbus 0000:02:07.1: pcmcia: parent PCI bridge Memory window: 0x80000000 - 0x87ffffff
+yenta_cardbus 0000:02:07.1: pcmcia: parent PCI bridge Memory window: 0x88000000 - 0x8fffffff


Could it be related and help the IO stress? I did not bother test with the device attached.
Comment 52 Stefan Richter 2010-09-11 09:08:30 UTC
Martin, sorry for the long period of silence.  Do you still have the hardware?  Is the issue still present in current kernels?
Comment 53 Stefan Richter 2010-10-09 11:54:27 UTC
Martin responded on 2010-09-21:
>  I have long-standing problem with accessing the bugzilla, I guess they
> have mysql issues there.  I can lend the camera back during X-mas, and the
> PCMCIA card and the laptop I still have.  I will re-try, and probably buy
> another comp with serial port.

For now I think it is justified to close this bug as not reproducible.  I am not aware of directly comparable reports.  You can reopen it if you continue to run into this issue.