Bug 13841 - 2.6.31-rc4 boot failure
2.6.31-rc4 boot failure
Status: CLOSED CODE_FIX
Product: Other
Classification: Unclassified
Component: Other
All Linux
: P1 normal
Assigned To: other_other
:
Depends on:
Blocks: 13615
  Show dependency treegraph
 
Reported: 2009-07-26 21:41 UTC by Rafael J. Wysocki
Modified: 2009-07-29 20:59 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.31-rc4
Tree: Mainline
Regression: Yes


Attachments

Description Rafael J. Wysocki 2009-07-26 21:41:58 UTC
Subject    : 2.6.31-rc4 boot failure
Submitter  : Gene Heskett <gene.heskett@verizon.net>
Date       : 2009-07-23 14:12
References : http://marc.info/?l=linux-kernel&m=124835839019906&w=4

This entry is being used for tracking a regression from 2.6.29.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2009-07-27 22:48:23 UTC
On Monday 27 July 2009, Gene Heskett wrote:
> On Sunday 26 July 2009, Rafael J. Wysocki wrote:
> >This message has been generated automatically as a part of a report
> >of recent regressions.
> >
> >The following bug entry is on the current list of known regressions
> >from 2.6.30.  Please verify if it still should be listed and let me know
> >(either way).
> >
> Yes.  I have nuked the odd stuff in my rc.local file, and rebuilt this kernel 
> with several mods that I thought might be related, but it is still failing in 
> the same manner.  The error messages (apparently from lp:, but long after cups 
> has been started) do NOT make it to the messages log file either, so its 
> totally blown up at that point.
> 
> Note, that this is a regression from 2.6 31-rc3, which works fine.  So the 
> thing shouldn't be that hard to find.  But in looking over the changelog, 
> nothing obviously reaches out and grabs me.
> 
> I'm not really equipt to do a bisect here either, my git from F10 is at least 
> 2 versions old now.  And I'm crippled by a way too small /boot partition which 
> can't hold more than 12-14 kernels.  The disk partitioning tool in F10 is 
> nothing short of fscking broken IMO. But fedora isn't interested in that 
> either, cuz its existed since at least Fedora 2.  Here, fedora is on its way 
> out, 64 bit mandriva sure looks nice.  And DiskDrake Just Works(TM).
Comment 2 Gene Heskett 2009-07-28 03:24:28 UTC
I have at least found the triggering script.  It is actually a 2 piece setup, where the 2nd one gets called only if there is something to print.  As its been working nicely for an extended period of time (at least 1.5 years), I did not initially suspect that it could be the problem.

The 1st script runs as a background daemon, listening to /dev/ttyUSB1, which is an extension usb hub, with both a printer, and a serial adapter plugged into it.  Both the printer, and the old computer on the other side of the FTDI rs232 adapter are powered down 99% of the time, which is the present condition.

From the meager clues I obtained on this last boot, something has changed in how the kernel or the filesystem handles a query of the

 while [[ -f ${OutFile} ]]

general syntax, returning an I/O error at startup, and the script is looping forever, creating, deleting and re-creating the 25 scratchpad files it uses on a round robin basis.  The disk where /tmp lives is being 'exercised' noticeably.

That may not be where the error really lives, but its the best I can deduce from the clues I have ATM.  With the script killed, the machine is otherwise happily running 2.6.31-rc4 right now.  There may be an error in the script, but from 2.6.25 or so, it has been working flawlessly, until 2.6.31-rc4.  That tends to make me think something a lot closer to the filesystem core than bash is has changed how it works.

More if I get it figured out.

Thanks.
Comment 3 Gene Heskett 2009-07-28 03:25:46 UTC
If some bash script guru wants to look at it, yelp at me.
Comment 4 Gene Heskett 2009-07-28 17:01:20 UTC
Here is the progenitor line of my script, and an echo statement before it, that results in the I/O error that kills it, only for 2.6.31-rc4, rc3 & many previous kernels over the last 2 years work fine.

From the script, lines 37-38:
----------
echo $InDev
exec 0< ${InDev}        # changes input stream for while read inp below
----------

'inp' is the bash variable that holds the data captured from $InDev when it comes in, and is supposedly empty/null at that point.

Started without the daemonizing '&' as a line terminator:
----------------------
[root@coyote libexec]# /usr/local/libexec/cocod /dev/ttyUSB1 Brother-HL2140
/dev/ttyUSB1
/usr/local/libexec/cocod: line 38: /dev/ttyUSB1: Input/output error
----------------------
So $InDev is valid.
The device exists, this listing obtained while booted to 2.6.31-rc4:

[root@coyote amanda]# ls -l /dev/ttyUSB*
crw-rw---- 1 root uucp 188, 0 2009-07-27 23:06 /dev/ttyUSB0
crw-rw---- 1 root uucp 188, 1 2009-07-27 22:28 /dev/ttyUSB1

This is also an accepted way to permanently redirect an I/O stream in bash, has been used since forever, but 2.6.31-rc4 broke it.  Whatever changed that is the regression.

Cheers, Gene.
Comment 5 Alan 2009-07-29 11:02:42 UTC
Dup of 13821 I think
Comment 6 Gene Heskett 2009-07-29 17:50:13 UTC
Ok, then if I do a git bisect reset master; git pull, I should have a fix?
Comment 7 Gene Heskett 2009-07-29 19:24:11 UTC
And indeed it is fixed, its running now.  Many thanks, this opne I believe, can be closed.
Comment 8 Gene Heskett 2009-07-29 19:25:11 UTC
s/opne/one/g

Note You need to log in before you can comment on or make changes to this bug.