22122 – pty lockup at kernel/workqueue.c:1180

Bug 22122 - pty lockup at kernel/workqueue.c:1180

Summary: pty lockup at kernel/workqueue.c:1180

Status:	RESOLVED CODE_FIX

Alias:	None

Product:	Drivers
Classification:	Unclassified
Component:	Other (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	Tejun Heo

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-11-04 20:02 UTC by James Cloos
Modified:	2012-08-14 11:37 UTC (History)
CC List:	3 users (show)

See Also:
Kernel Version:	trunk
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
wq-debug.patch (3.10 KB, patch) 2010-11-11 14:25 UTC, Tejun Heo	Details \| Diff
wq-debug-1.patch (3.43 KB, patch) 2010-11-11 14:28 UTC, Tejun Heo	Details \| Diff
Dmesg fro gg8b16d + the patch from attachment #2 (12.02 KB, application/octet-stream) 2010-11-18 18:49 UTC, James Cloos	Details
wq-debug-2.patch (4.59 KB, patch) 2010-11-19 18:26 UTC, Tejun Heo	Details \| Diff
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Description James Cloos 2010-11-04 20:02:01 UTC

This oops is more frequent with trunk as of 2.6.37-rc1-00027-gff8b16d than it had been with v2.6.36.

One of the early hits looked like:

 WARNING: at kernel/workqueue.c:1180 worker_enter_idle+0xd6/0xe2()
 Hardware name: MS-7642
 Modules linked in: tcp_diag inet_diag ipt_addrtype xt_dscp xt_string xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP xt_NFQUEUE xt_mark xt_connmark tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm ppdev parport_pc tpm_bios parport serio_raw edac_core k10temp pcspkr i2c_piix4 shpchp
 Pid: 8061, comm: kworker/0:1 Not tainted 2.6.36-carbon1 #18
 Call Trace:
  [<ffffffff81050c60>] warn_slowpath_common+0x85/0x9d
  [<ffffffff81050c92>] warn_slowpath_null+0x1a/0x1c
  [<ffffffff81066861>] worker_enter_idle+0xd6/0xe2
  [<ffffffff81068453>] worker_thread+0x182/0x19b
  [<ffffffff810682d1>] ? worker_thread+0x0/0x19b
  [<ffffffff8106ba81>] kthread+0x82/0x8a
  [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
  [<ffffffff8106b9ff>] ? kthread+0x0/0x8a
  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
 ---[ end trace 756b0818a6415dca ]---

A more recent one from ff8b14d is:

WARNING: at kernel/workqueue.c:1169 worker_enter_idle+0xd6/0xe2()
Hardware name: MS-7642
Modules linked in: tcp_diag inet_diag tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm tpm_bios ppdev edac_core i2c_piix4 parport_pc pcspkr k10temp serio_raw parport shpchp
Pid: 9863, comm: kworker/3:0 Not tainted 2.6.37-rc1-carbon1-00027-gff8b16d #23
Call Trace:
 [<ffffffff810523f4>] warn_slowpath_common+0x85/0x9d
 [<ffffffff81052426>] warn_slowpath_null+0x1a/0x1c
 [<ffffffff81068537>] worker_enter_idle+0xd6/0xe2
 [<ffffffff8106a25d>] worker_thread+0x18b/0x1a4
 [<ffffffff8106a0d2>] ? worker_thread+0x0/0x1a4
 [<ffffffff8106d8a5>] kthread+0x82/0x8a
 [<ffffffff8100bb24>] kernel_thread_helper+0x4/0x10
 [<ffffffff8106d823>] ? kthread+0x0/0x8a
 [<ffffffff8100bb20>] ? kernel_thread_helper+0x0/0x10
---[ end trace 70e4ee2bc81f41bd ]---


Once it hits, all ptys are locked; anything trying to write to STDOUT is stuck in kernel (neither SIGKILL nor a RT signal will terminate the processes).

With v2.6.36 (starting sometime during the rc’s) it is infrequent.  With last night’s trunk it only takes a few minutes of uptime, even w/o any significant load on the ptys.

Posting here to make sure it is not lost.

Box is an am3 fam10, kernel is smp preempt.

(I use the -carbon localversion to track changes to .config and which kernel goes with which box.  There were no patches applied.)

Comment 1 Tejun Heo 2010-11-09 09:02:41 UTC

Yeah, I tried to root-cause it but haven't been successful yet.  Is there any way to reproduce it?  Under what conditions does this happen and how often?

Thanks.

Comment 2 James Cloos 2010-11-09 21:03:03 UTC

> Yeah, I tried to root-cause it but haven't been successful yet.
> Is there any way to reproduce it?  Under what conditions does
> this happen and how often?

With kernel ff8b16d7e15a it occurs every boot after just a few minutes.

Box is fam10; dist is gentoo ~amd64; gcc is 4.5.1 with the graphite and
lto support compiled in (which requires http://www.cs.unipr.it/ppl/ and
http://repo.or.cz/w/cloog-ppl.git).  (Gentoo does apply some patches to
4.5.1; I believe all from the 4.5 branch.)

I have "console=ttyS0,115200n8r" in the command line, and agetty(8) also
runs on ttyS0.  I cannot confirm that it works, though; my laptop’s
serial port seems to be kaput.  (The fam10 is intended as a headless
compute node; I use the laptop as an X server.)

I stuck the current config at http://jhcloos.com/t/fam10.config.xz

It has a bunch of speculative enables and probably a few useless ones;
I haven’t confirmed the need for everything in it….

Comment 3 Tejun Heo 2010-11-11 13:45:27 UTC

Thanks for the input.  Hmmm... I tried to reproduce it but haven't been successful yet.  It's weird that the other reported case was also related with tty code.  Well, at least you can reproduce it somewhat reliably, so that's good.  I'm preparing a debug patch.  Will post it soon.

Comment 4 Tejun Heo 2010-11-11 14:25:50 UTC

Created attachment 37132 [details]
wq-debug.patch

Can you please apply this patch, trigger the problem and attach full log?

Thank you.

Comment 5 Tejun Heo 2010-11-11 14:28:24 UTC

Created attachment 37142 [details]
wq-debug-1.patch

Oops, forgot something.  Please use this one.

Comment 6 James Cloos 2010-11-11 17:41:59 UTC

I will test wq-debug-1.patch later today or tonight.

Comment 7 James Cloos 2010-11-17 22:01:30 UTC

Testing proved more difficult than expected.

When I added your patch to what had been the most recent version of the
kernel I had previously tested, the problem did not occur.

I have CONFIG_LOCALVERSION_AUTO=y in that kernel, though, so adding the
patch added '-dirty' to the kernel version; that, of course, caused many
files to recompile.

This means that the bug may be compiler-specific, or it may be a more
typical heisenbug.

I also tried adding it the tip, but that version didn't work at all.
(The serial console bug.)

I want to test a more recent tip, but need to find a safe way to do so
which does not require a console.  Hefting it between here and the TV
room is a drag.  (Said TV is the only available monitor, and I lack a
dekametre hdmi cable.)

-JimC

Comment 8 Tejun Heo 2010-11-18 06:32:22 UTC

Did the kernel trigger any warning messages and stack dumps with the patch applied?  If so, can you please attach full kernel log?

Comment 9 James Cloos 2010-11-18 18:36:28 UTC

> Did the kernel trigger any warning messages and stack dumps with the patch
> applied?  If so, can you please attach full kernel log?

There wasn't a lockup, but looking at the dmesg dumps again I see that
it did output some call traces.  Compressed attachment to follow.

Comment 10 James Cloos 2010-11-18 18:49:50 UTC

Created attachment 37602 [details]
Dmesg fro gg8b16d + the patch from attachment #2 [details]

I was unable to get a dmesg from the then-tip with the patch; that crashed too soon because of the serial-console-related bug.

Production needs will keep me from testing current tip for a while.

Comment 11 Tejun Heo 2010-11-19 17:21:28 UTC

Heh, that's interesting.  How does the counter go off without triggering the running state sanity check?  Weird.  I'll prep another debug patch soon.  Thank you.

Comment 12 Tejun Heo 2010-11-19 18:26:18 UTC

Created attachment 37672 [details]
wq-debug-2.patch

Can you please try this patch and report the kernel warnings?  Also, please turn on the printk timestamp.

Thank you.

Comment 13 James Cloos 2010-11-19 19:32:14 UTC

I’ll give wq-debug-2.patch a try. 

It probably won’t be until the weekend, though.

Comment 14 Tejun Heo 2010-11-19 19:54:39 UTC

Can you please attach .config?  Let's see if I can reproduce it.

Thanks.

Comment 15 Tejun Heo 2010-11-24 16:50:39 UTC

For some reason, James' message couldn't be committed to bugzilla.  Forwarding...

James Cloos wrote:
> > Before trying the last patch I thought I should give unpatched tip
> > another try.
> > 
> > As it turned out that was the last commit before the rc3 tag.
> > 
> > There were, however, three other variables.  
> > 
> > The gcc-4.5.1 ebuild was updated with a new patchset (mostly taken from
> > upsteam svn, akin to the other dists) and I removed the serial console
> > invocation from the command line.  There is little point of it given
> > that the laptop’s serial port refuses to work.
> > 
> > I also did a make clean, before rebuilding, just in case.
> > 
> > I cannot get this compile to generate the lockup.
> > 
> > At this point, I’m leaning towards closing this as a compiler error,
> > but i first should test with serial console, just to be sure.

Note You need to log in before you can comment on or make changes to this bug.