Created attachment 156311 [details] dmesg I notice a random freeze every 5~10 boots with kernel 3.17.1 and 3.17.2, every 5~10 boots a freeze can occur shortly after the load of the kernel, I can see these messages on screen and then nothing, it seems like a freeze : :: running early hook [udev] :: running hook [udev] :: Triggering uvents... sometimes the freeze happens a few seconds after systemd starts ( after the message "mount /home" for example ) I suspect the new loading mode of "intel-ucode" package, maybe it can bring some collision/timing problems ? https://www.archlinux.org/news/changes-to-intel-microcodeupdates/ my configuration : archlinux 64 bits cpu pentium dual core E6800 3.33 Ghz ati radeon HD4650 PCIe ( radeon open source driver, KMS early start ) I use the "kernel mode setting" early start mode, in /etc/mkinitcpio.conf I have this line : MODULES="radeon"
Created attachment 156321 [details] dmesg dmesg
Created attachment 156341 [details] the kernel trace 120 seconds after the freeze in fact the culprit is not intel-ucode, because if I disable the microcode update I still have the random freeze on boot, when the freeze occurs few seconds after the start of systemd a kernel message will occur after 120 seconds of inactivity of systemd, here is the screenshot of this message in attachment
I found the culprit: it was systemd 216.3, If I upgrade to the testing version of systemd ( 217.3 ) the bug is gone, no problems https://www.archlinux.org/packages/testing/x86_64/systemd/
well I was too optimistic, because after further tests I have again the bug
I tested with kernel 3.18rc3 and the bug is still here, every 5~10 boots I have a freeze shortly after the kernel has been loaded, it's not normal, because with kernel 3.16.x I don't have this problem
please I need help, when the bug occurs I can read sometimes these error messages : :: running hook [udev] :: Triggering uevents... worker [53] /devices/pci0000:00/0000:00:1f.2/ata8/host7/target7:0:1/7:0:1:0 is taking a long time worker [60] /devices/pci0000:00/0000:00:1f.2/ata7/host6/target6:0:0/6:0:0:0/block/sr0 is taking a long time If I use 3.16.4 then there is no problems, all is ok, so there is a big problem with kernel 3.17.x and 3.18.x, sometimes a bad initialisation at boot with my PC, probably the harddisks and/or the PCI devices, something that triggers a freeze
Created attachment 156871 [details] error message when freeze occurs with kernel 3.17 and 3.18 error message when freeze occurs at boot with kernel 3.17 and 3.18
another user in archlinux forums has the same problem : https://bbs.archlinux.org/viewtopic.php?pid=1473209#p1473209
I suspect on of these 3 commits to be the culprit for the race condition : commit 1d3408209d43d2e72b5d8dbfb9b60fece399d75b Author: Frederic Weisbecker <fweisbec@gmail.com> Date: Sat Aug 16 18:47:15 2014 +0200 x86: Tell irq work about self IPI support commit 3010279f0fc36f0388872203e63ca49912f648fd upstream. x86 supports irq work self-IPIs when local apic is available. This is partly known on runtime so lets implement arch_irq_work_has_interrupt() accordingly. This should be safely called after setup_arch(). Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 6fd5de08a5337d0d601f6361671813df4f013da9 Author: Frederic Weisbecker <fweisbec@gmail.com> Date: Sat Aug 16 18:37:19 2014 +0200 irq_work: Force raised irq work to run on irq work interrupt commit 76a33061b9323b7fdb220ae5fa116c10833ec22e upstream. The nohz full kick, which restarts the tick when any resource depend on it, can't be executed anywhere given the operation it does on timers. If it is called from the scheduler or timers code, chances are that we run into a deadlock. This is why we run the nohz full kick from an irq work. That way we make sure that the kick runs on a virgin context. However if that's the case when irq work runs in its own dedicated self-ipi, things are different for the big bunch of archs that don't support the self triggered way. In order to support them, irq works are also handled by the timer interrupt as fallback. Now when irq works run on the timer interrupt, the context isn't blank. More precisely, they can run in the context of the hrtimer that runs the tick. But the nohz kick cancels and restarts this hrtimer and cancelling an hrtimer from itself isn't allowed. This is why we run in an endless loop: Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34 Workqueue: btrfs-endio-write normal_work_helper [btrfs] ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000 Call Trace: <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a [<ffffffff8a7ef928>] panic+0xd4/0x207 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0 [<ffffffff8a008268>] do_nmi+0xb8/0x100 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80 To fix this we force non-lazy irq works to run on irq work self-IPIs when available. That ability of the arch to trigger irq work self IPIs is available with arch_irq_work_has_interrupt(). Reported-by: Catalin Iacob <iacobcatalin@gmail.com> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit b677a767bcb3b9fbb4c7f921e5ba9577f1577049 Author: Peter Zijlstra <peterz@infradead.org> Date: Sat Sep 6 15:43:02 2014 +0200 irq_work: Introduce arch_irq_work_has_interrupt() commit c5c38ef3d70377dc504a6a3f611a3ec814bc757b upstream. The nohz full code needs irq work to trigger its own interrupt so that the subsystem can work even when the tick is stopped. Lets introduce arch_irq_work_has_interrupt() that archs can override to tell about their support for this ability. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> https://gitorious.org/opensuse/kernel/commit/1d3408209d43d2e72b5d8dbfb9b60fece399d75b http://gitorious.ti.com/ti-linux-kernel/ti-linux-kernel/commit/76a33061b9323b7fdb220ae5fa116c10833ec22e http://gitorious.ti.com/ti-linux-kernel/ti-linux-kernel/commit/c5c38ef3d70377dc504a6a3f611a3ec814bc757b
when I said "freeze" it's not really a freeze, because my keyboard is still alive ( num, shift, alt and caps keys works ), it's a freeze in terms of "the boot has suddenly stopped", it's seems that the kernel or systemd wait something on boot when the random bug occurs, and they can't go further in the boot process, I have to do a reset with the "reset" button, I'm doing a git bisect between 3.16.7 ( who doesn't have the bug ) and kernel 3.17 ( who has the bug ) in order to find the commit who has introduced this bug
I found the commit who has introduced the bug after doing a git bisect, the first bad commit is : first bad commit: [045065d8a300a37218c548e9aa7becd581c6a0e8] [SCSI] fix qemu boot hang problem http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=045065d8a300a37218c548e9aa7becd581c6a0e8 diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 9c44392..ce62e87 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue *q) blk_requeue_request(q, req); atomic_dec(&sdev->device_busy); out_delay: - if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev)) + if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev)) blk_delay_queue(q, SCSI_QUEUE_DELAY); } the author of this faulty commit is Guenter Roeck <linux@roeck-us.net> , maybe someone should contact him, in order to warn him that his patch triggers a bug on some PC configuration ( like my PC )
Created attachment 157211 [details] the log of the git bisect the log of the git bisect
Created attachment 157221 [details] revert the faulty commit 045065d8a300a37218c548e9aa7becd581c6a0e8 I create a patch who solves the bug, this patch reverts the faulty commit 045065d8a300a37218c548e9aa7becd581c6a0e8
Created attachment 157621 [details] second bisect second bisect
Created attachment 157821 [details] increase the time in line 1779 of scsi_lib.c a new patch who solves the problem by increasing the time ( 40 ms instead of 3 ms ) in line 1779 for blk_delay_queue() function
I solved the mystery, I found that the element who triggers the bug ( random hang at boot with kernel 3.17 and 3.18 ) is the combination of 3 elements : - the use of a SATA DVD burner ( Liteon iHAS124 C ) on a ICH7 Sata controler - the use of a gigabyte motherboard GA-P31-DSL3 ( bios F10A, ICH7 controler, intel P31 chipset ) - commit 74665016086615bbaa3fa6f83af410a0a4e029ee ( scsi: convert host_busy to atomic_t ) If I connect this Sata DVD burner and a sata harddisk by using the SATA ports of the motherboard then the bug will occur ( but the bug will occur only on kernels 3.17 and 3.18, there is no problems with older kernels, and no problems with Windows 7 ) If I disconnect the SATA DVD burner then the bug is gone, no problems with kernels 3.17 and 3.18, And if I connect the SATA DVD burner on my JMicron SATA/IDE PCIe card then there is no problem, no bugs, this is a perfect workaround for my problem, because I can use kernel 3.17/3.18 without problem with this configuration. But I don't know which element I should blame, my gigabyte motherboard ? ( faulty bios ? ) The use of "atomic_t" in scsi source code ? ( innapropriate way to handle SATA devices, it breaks compatibility with some PC configurations ? )
Created attachment 158411 [details] test patch test patch made by a linux developer who solves the bug
The patch in attachment 157821 [details] (increase delay time) solves the issue for me, too: no more hung tasks for some 20 boots. My hardware is a bit different: i7 Haswell (with early microcode upload), MSI Z87-G43 board with SSD, HD and DVD attached to the SATA ports. One common thing may be the JMicron JMB363 SATA/IDE controller on PCIe, used for legacy drives. Hung tasks appeared for the HD attached to the MB and more frequently for IDE DVD drives attached to the JMicron controller. In the first case the file systems on the HD did not show up and mount hung. In the latter case, typically cdrom_id triggered the watchdog message and the start of the kde session did not complete. I had issues with older drives connected to the JMicron controller before. So this is the major suspect for me. Hung tasks only appeared with kernel 3.17, though. I could also try the patch in attachment 158411 [details] if it is considered to be less of a workaround.
(In reply to gdpt from comment #18) > I could also try the patch in attachment 158411 [details] if > it is considered to be less of a workaround. yes you should try this patch : https://bugzilla.kernel.org/attachment.cgi?id=158411 the bug seems triggered if a slow device ( DVD burner ) is mixed with a fast device ( hard disk ) on the same IDE/SATA channel, the bug also seems triggered if you use the "IDE emulation mode" option in your bios settings instead of the "AHCI" mode, and if a slow device ( DVD burner ) is connected, so you can try a test : remove your IDE DVD burner, without the DVD burner the bug should disapear
The second patch also seems to solve the issue (>10 boots). Your theory fits my original set-up with different IDE drives on the JMicron controller. Later, I must have set the MB to IDE mode in an attempt to stabilize the system, causing new hangs with the HD and a DVD burner.
It seems that my setup is also affected by this bug. I've go GA-MA790X-DS4 motherboard with DVD burner and HD mixed on ATA (not SATA) channel. Patch from https://bugzilla.kernel.org/attachment.cgi?id=158411 fixes locks also for me (>10 boots).
Who's that patch from? And can somebody dust it off and post it w/ proper description?
(In reply to Tejun Heo from comment #22) > Who's that patch from? And can somebody dust it off and post it w/ proper > description? this patch comes from a linux developper ( Christoph Hellwig ) https://lkml.org/lkml/2014/11/20/25 https://bugzilla.kernel.org/attachment.cgi?id=158411
I also have this issue, although I do not have a intel chipset motherboard, but rather the Gigabyte 990fxa-UD3. I also have a dvd burner attached to my SATA port. I will try removing the DVD burner and reporting back.
Removing the DVD burner from the MARVEL SATA connection did not change the issue for me. However switching my MARVEL SATA ports from AHCI back to IDE seems to immediately cause my boot process to be faster and I have yet to run into a boot lockup.