This oops is more frequent with trunk as of 2.6.37-rc1-00027-gff8b16d than it had been with v2.6.36. One of the early hits looked like: WARNING: at kernel/workqueue.c:1180 worker_enter_idle+0xd6/0xe2() Hardware name: MS-7642 Modules linked in: tcp_diag inet_diag ipt_addrtype xt_dscp xt_string xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP xt_NFQUEUE xt_mark xt_connmark tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm ppdev parport_pc tpm_bios parport serio_raw edac_core k10temp pcspkr i2c_piix4 shpchp Pid: 8061, comm: kworker/0:1 Not tainted 2.6.36-carbon1 #18 Call Trace: [<ffffffff81050c60>] warn_slowpath_common+0x85/0x9d [<ffffffff81050c92>] warn_slowpath_null+0x1a/0x1c [<ffffffff81066861>] worker_enter_idle+0xd6/0xe2 [<ffffffff81068453>] worker_thread+0x182/0x19b [<ffffffff810682d1>] ? worker_thread+0x0/0x19b [<ffffffff8106ba81>] kthread+0x82/0x8a [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10 [<ffffffff8106b9ff>] ? kthread+0x0/0x8a [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 ---[ end trace 756b0818a6415dca ]--- A more recent one from ff8b14d is: WARNING: at kernel/workqueue.c:1169 worker_enter_idle+0xd6/0xe2() Hardware name: MS-7642 Modules linked in: tcp_diag inet_diag tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm tpm_bios ppdev edac_core i2c_piix4 parport_pc pcspkr k10temp serio_raw parport shpchp Pid: 9863, comm: kworker/3:0 Not tainted 2.6.37-rc1-carbon1-00027-gff8b16d #23 Call Trace: [<ffffffff810523f4>] warn_slowpath_common+0x85/0x9d [<ffffffff81052426>] warn_slowpath_null+0x1a/0x1c [<ffffffff81068537>] worker_enter_idle+0xd6/0xe2 [<ffffffff8106a25d>] worker_thread+0x18b/0x1a4 [<ffffffff8106a0d2>] ? worker_thread+0x0/0x1a4 [<ffffffff8106d8a5>] kthread+0x82/0x8a [<ffffffff8100bb24>] kernel_thread_helper+0x4/0x10 [<ffffffff8106d823>] ? kthread+0x0/0x8a [<ffffffff8100bb20>] ? kernel_thread_helper+0x0/0x10 ---[ end trace 70e4ee2bc81f41bd ]--- Once it hits, all ptys are locked; anything trying to write to STDOUT is stuck in kernel (neither SIGKILL nor a RT signal will terminate the processes). With v2.6.36 (starting sometime during the rc’s) it is infrequent. With last night’s trunk it only takes a few minutes of uptime, even w/o any significant load on the ptys. Posting here to make sure it is not lost. Box is an am3 fam10, kernel is smp preempt. (I use the -carbon localversion to track changes to .config and which kernel goes with which box. There were no patches applied.)
Yeah, I tried to root-cause it but haven't been successful yet. Is there any way to reproduce it? Under what conditions does this happen and how often? Thanks.
> Yeah, I tried to root-cause it but haven't been successful yet. > Is there any way to reproduce it? Under what conditions does > this happen and how often? With kernel ff8b16d7e15a it occurs every boot after just a few minutes. Box is fam10; dist is gentoo ~amd64; gcc is 4.5.1 with the graphite and lto support compiled in (which requires http://www.cs.unipr.it/ppl/ and http://repo.or.cz/w/cloog-ppl.git). (Gentoo does apply some patches to 4.5.1; I believe all from the 4.5 branch.) I have "console=ttyS0,115200n8r" in the command line, and agetty(8) also runs on ttyS0. I cannot confirm that it works, though; my laptop’s serial port seems to be kaput. (The fam10 is intended as a headless compute node; I use the laptop as an X server.) I stuck the current config at http://jhcloos.com/t/fam10.config.xz It has a bunch of speculative enables and probably a few useless ones; I haven’t confirmed the need for everything in it….
Thanks for the input. Hmmm... I tried to reproduce it but haven't been successful yet. It's weird that the other reported case was also related with tty code. Well, at least you can reproduce it somewhat reliably, so that's good. I'm preparing a debug patch. Will post it soon.
Created attachment 37132 [details] wq-debug.patch Can you please apply this patch, trigger the problem and attach full log? Thank you.
Created attachment 37142 [details] wq-debug-1.patch Oops, forgot something. Please use this one.
I will test wq-debug-1.patch later today or tonight.
Testing proved more difficult than expected. When I added your patch to what had been the most recent version of the kernel I had previously tested, the problem did not occur. I have CONFIG_LOCALVERSION_AUTO=y in that kernel, though, so adding the patch added '-dirty' to the kernel version; that, of course, caused many files to recompile. This means that the bug may be compiler-specific, or it may be a more typical heisenbug. I also tried adding it the tip, but that version didn't work at all. (The serial console bug.) I want to test a more recent tip, but need to find a safe way to do so which does not require a console. Hefting it between here and the TV room is a drag. (Said TV is the only available monitor, and I lack a dekametre hdmi cable.) -JimC
Did the kernel trigger any warning messages and stack dumps with the patch applied? If so, can you please attach full kernel log?
> Did the kernel trigger any warning messages and stack dumps with the patch > applied? If so, can you please attach full kernel log? There wasn't a lockup, but looking at the dmesg dumps again I see that it did output some call traces. Compressed attachment to follow.
Created attachment 37602 [details] Dmesg fro gg8b16d + the patch from attachment #2 [details] I was unable to get a dmesg from the then-tip with the patch; that crashed too soon because of the serial-console-related bug. Production needs will keep me from testing current tip for a while.
Heh, that's interesting. How does the counter go off without triggering the running state sanity check? Weird. I'll prep another debug patch soon. Thank you.
Created attachment 37672 [details] wq-debug-2.patch Can you please try this patch and report the kernel warnings? Also, please turn on the printk timestamp. Thank you.
I’ll give wq-debug-2.patch a try. It probably won’t be until the weekend, though.
Can you please attach .config? Let's see if I can reproduce it. Thanks.
For some reason, James' message couldn't be committed to bugzilla. Forwarding... James Cloos wrote: > > Before trying the last patch I thought I should give unpatched tip > > another try. > > > > As it turned out that was the last commit before the rc3 tag. > > > > There were, however, three other variables. > > > > The gcc-4.5.1 ebuild was updated with a new patchset (mostly taken from > > upsteam svn, akin to the other dists) and I removed the serial console > > invocation from the command line. There is little point of it given > > that the laptop’s serial port refuses to work. > > > > I also did a make clean, before rebuilding, just in case. > > > > I cannot get this compile to generate the lockup. > > > > At this point, I’m leaning towards closing this as a compiler error, > > but i first should test with serial console, just to be sure.