Bug 9612 - Hang during boot with NO_HZ
Summary: Hang during boot with NO_HZ
Status: CLOSED CODE_FIX
Alias: None
Product: Timers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: john stultz
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-12-20 19:45 UTC by Nathan Myers
Modified: 2009-01-13 00:54 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.23.11
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
lspci -vvv output (18.66 KB, text/plain)
2008-03-22 13:24 UTC, Nathan Myers
Details
successful boot using a kernel that sometimes hangs as described (31.47 KB, application/octet-stream)
2008-03-22 13:26 UTC, Nathan Myers
Details
The .config for kernel 2.6.23.11 that sometimes failed (52.38 KB, text/plain)
2008-03-22 13:27 UTC, Nathan Myers
Details
cfs-debug-info.sh results (610.53 KB, text/plain)
2008-07-17 13:55 UTC, Robert Chéramy
Details

Description Nathan Myers 2007-12-20 19:45:51 UTC
Most recent kernel where this bug did not occur: (pre-NO_HZ)
Distribution: Debian
Hardware Environment: Dell Latitude D620 Intel Core2 Duo T7200 2GHz
  Nvidia Quadro NVS 110M / GeForce Go 7300 (rev a1)
Software Environment: Boot from sata, no initramfs, no binary modules
  Kernel built with gcc-4.2.  Normally I patch in tuxonice, although
  that doesn't seem to affect the sequence of boot messages.
Problem Description:
  During boot, messages
  .
    ...
    nvidiafb: CRTC1 analog not found
    i2c-adapter i2c-0: unable to read EDID block
    i2c-adapter i2c-0: unable to read EDID block
    i2c-adapter i2c-0: unable to read EDID block
    i2c-adapter i2c-1: unable to read EDID block
    i2c-adapter i2c-1: unable to read EDID block
    i2c-adapter i2c-1: unable to read EDID block
    Switched to high resolution mode on CPU 1
    Switched to high resolution mode on CPU 0
  .
  and then it hangs, with no response to keyboard actions.  
  Adding nohz=off to the boot line allows boot to continue, leading to 
  .
    IP route cache hash table entries 32768 (order: 5, 131072 bytes)
  .
  after the point where it had frozen.

Steps to reproduce:
  Just boot.  (Behavior was the same in 2.6.22.9 at that time,
  although not often previously.)  

(Is there any point in using acpi_irq_balance any more?)

However, I have been unable to reproduce this failure today
despite unchanged hardware and unchanged kernel images. (I.e.,
the failure is intermittent, on a fairly long time scale.) 

When I boot from power-off today, the order of messages 
produced is very different.  In particular, the "i2c-adapter" 
EDID messages come long after the switch to high resolution 
mode on each CPU.  I don't know why switching to high 
resolution mode is delayed so long in the failed boot.
Comment 1 Thomas Gleixner 2007-12-21 02:14:29 UTC
Can you please provide dmesg output of a successful boot along with the output of "lspci -vvv" and your .config file.

Thanks,
       tglx
Comment 2 Ingo Molnar 2007-12-21 02:33:45 UTC
> Can you please provide dmesg output of a successful boot along with 
> the output of "lspci -vvv" and your .config file.

a one-stop-shop would be to run this script:

 http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

and to attach the resulting file.

[ Note, if you dont have this set in your .config:

    CONFIG_IKCONFIG=y
    CONFIG_IKCONFIG_PROC=y

  then /proc/config.gz wont be included in the output file and you'll 
  need to attach it extra to this bugzilla. ]
Comment 3 Ingo Molnar 2007-12-21 02:51:16 UTC
> [ Note, if you dont have this set in your .config:
> 
>     CONFIG_IKCONFIG=y
>     CONFIG_IKCONFIG_PROC=y
> 
>   then /proc/config.gz wont be included in the output file and you'll 
>   need to attach it extra to this bugzilla. ]

ok, i've updated cfs-debug-info.sh so that it retrieves the currently 
running kernel package's config automatically. So in theory it should be 
plug & play.
Comment 4 Thomas Gleixner 2008-03-22 04:55:28 UTC
Nathan, 

any news on this one ? Is the problem still present with current mainline ?

Thanks,
        tglx
Comment 5 Nathan Myers 2008-03-22 13:22:57 UTC
I'm sorry for going hors de combat on this; [cue usual excuses].
I didn't send output immediately because it seemed important to
verify that it occurred on a vanilla (no TUXONICE) kernel.  
It did, finally, but [cue usual excuses].  

Anyway, the hang has not recurred since I switched to 2.4.24.3.
If you are still interested in details, I can try to produce them.
I imagine there are still users of 2.4.23 who might benefit from 
a fix.

I should mention, too, that the transcription from my notes in 
the original report had an error: there were no lines "i2c-adapter 
i2c-1: unable to read EDID block" (i.e., only "i2c-0").  Also, 
there was a big difference in the boot sequence in failed cases 
than in successful ones: the "switch to high resolution mode" 
attempt really did occur much,much later in the failed boots, 
as suggested by the transcript above with "nvidiafb: CRTC1 analog 
not found".

I'll attach various files next.
Comment 6 Nathan Myers 2008-03-22 13:24:54 UTC
Created attachment 15403 [details]
lspci -vvv output

This lspci output is current, as reported from a 2.4.24.3 kernel,
but with no hardware changes since the failures.
Comment 7 Nathan Myers 2008-03-22 13:26:01 UTC
Created attachment 15404 [details]
successful boot using a kernel that sometimes hangs as described
Comment 8 Nathan Myers 2008-03-22 13:27:58 UTC
Created attachment 15405 [details]
The .config for kernel 2.6.23.11 that sometimes failed
Comment 9 Roland Kletzing 2008-05-14 15:35:31 UTC
i`m really no expert with this and i`m not sure if this will give a benefit to put a comment here, but i would have checked if

highres=off vs. highres=on 
or
booting with nosmp or maxcpus=1 

makes any difference.

i also would have tried a kernel without those i2c features builtin (or without loading appropriate modules on boot)
Comment 10 Robert Chéramy 2008-07-17 13:47:34 UTC
I'm experiencing the same problem on my Dell inspiron 530n.

passing highres=off or maxcpus=1 to the kernel solves the problem.

Using maxcpus=1 is a pity on a dual core, but I can live with highres=off.

Please also note that (without kernel options), the boot goes on after ca. 10 minutes opf freeze (no HDD nor screen activity, keyboard dead with num-led on).

This problem has been submited to redhat (https://bugzilla.redhat.com/show_bug.cgi?id=241249) where you can see a photo of a freezed boot.

I've also submited this problem to debian (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=481125)
Comment 11 Robert Chéramy 2008-07-17 13:55:55 UTC
Created attachment 16870 [details]
cfs-debug-info.sh results
Comment 12 Robert Chéramy 2008-07-26 00:16:28 UTC
Update : I had lately many boot freezes with option "highres=off", so this is not a valid workaround.
Comment 13 Robert Chéramy 2008-12-28 23:18:09 UTC
Update : I can't reproduce this bug anymore (I'm using 2.6.26-1-amd64), so it's fixed for me.

Note You need to log in before you can comment on or make changes to this bug.