Bug 100241 - NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
Summary: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Watchdog (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_watchdog@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-21 11:21 UTC by Mikhail
Modified: 2016-05-30 22:12 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.0.5-300.fc22.x86_64+debug
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (467.42 KB, application/octet-stream)
2015-06-21 11:21 UTC, Mikhail
Details

Description Mikhail 2015-06-21 11:21:03 UTC
Created attachment 180501 [details]
dmesg

[   52.080830] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [teamviewerd:1600]
[   52.080831] Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw bnep hid_logitech_hidpp hid_logitech_dj btusb bluetooth rfkill ppdev iTCO_wdt iTCO_vendor_support intel_rapl iosf_mbi btrfs x86_pkg_temp_thermal coretemp vfat xor kvm_intel raid6_pq kvm crct10dif_pclmul fat crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_hdmi serio_raw snd_hda_codec_generic
[   52.080852]  snd_hda_codec_ca0132 i2c_i801 snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm tpm_infineon parport_pc snd_timer tpm_tis parport mei_me snd tpm lpc_ich mei soundcore shpchp mfd_core nfsd auth_rpcgss nfs_acl lockd binfmt_misc grace sunrpc i915 8021q i2c_algo_bit garp drm_kms_helper stp llc mrp drm r8169 mii video uas usb_storage
[   52.080867] irq event stamp: 1075866
[   52.080868] hardirqs last  enabled at (1075865): [<ffffffff8188cdac>] restore_args+0x0/0x30
[   52.080871] hardirqs last disabled at (1075866): [<ffffffff8188d11d>] apic_timer_interrupt+0x6d/0x80
[   52.080873] softirqs last  enabled at (1075864): [<ffffffff810b2d03>] __do_softirq+0x3b3/0x670
[   52.080875] softirqs last disabled at (1075859): [<ffffffff810b3245>] irq_exit+0x145/0x150
[   52.080878] CPU: 1 PID: 1600 Comm: teamviewerd Not tainted 4.0.5-300.fc22.x86_64+debug #1
[   52.080879] Hardware name: Gigabyte Technology Co., Ltd. Z87M-D3H/Z87M-D3H, BIOS F11 08/12/2014
[   52.080880] task: ffff8807cc698000 ti: ffff8807cc6a0000 task.ti: ffff8807cc6a0000
[   52.080881] RIP: 0010:[<ffffffff81435105>]  [<ffffffff81435105>] copy_user_enhanced_fast_string+0x5/0x10
[   52.080884] RSP: 0000:ffff8807cc6a3c00  EFLAGS: 00010286
[   52.080885] RAX: 00000000f77a2000 RBX: ffffffff8188cdac RCX: 00000000000007bc
[   52.080885] RDX: 0000000000001000 RSI: 00000000f77a2844 RDI: ffff8807c8895844
[   52.080886] RBP: ffff8807cc6a3c38 R08: ffff880000000000 R09: 0000000000000000
[   52.080887] R10: ffff8807cc698000 R11: ffffea001f222540 R12: ffff8807cc6a3b78
[   52.080887] R13: ffff8807cc698000 R14: ffff8807cc6a0000 R15: ffff8807cc698000
[   52.080888] FS:  0000000000000000(0000) GS:ffff8807fe600000(0063) knlGS:00000000f73cb740
[   52.080889] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[   52.080890] CR2: 00000000f5f20000 CR3: 00000007e7dd3000 CR4: 00000000001407e0
[   52.080891] Stack:
[   52.080891]  ffffffff8143abca ffff8807cc6a3c38 0000000000001000 0000000000316000
[   52.080893]  ffff8807cc6a3e40 0000000000000000 ffff8807d7fdabc0 ffff8807cc6a3ce8
[   52.080895]  ffffffff811e6be9 ffff8807cc6a3ca8 ffff8807cc698000 ffffffff818881e4
[   52.080896] Call Trace:
[   52.080899]  [<ffffffff8143abca>] ? iov_iter_copy_from_user_atomic+0x8a/0x260
[   52.080901]  [<ffffffff811e6be9>] generic_perform_write+0xe9/0x1f0
[   52.080904]  [<ffffffff818881e4>] ? mutex_lock_nested+0x2b4/0x460
[   52.080906]  [<ffffffff811e9e5f>] __generic_file_write_iter+0x2df/0x380
[   52.080908]  [<ffffffff81311af4>] ext4_file_write_iter+0x174/0x490
[   52.080911]  [<ffffffff81027b8d>] ? native_sched_clock+0x2d/0xa0
[   52.080914]  [<ffffffff81276ecb>] ? vfs_write+0x1ab/0x210
[   52.080916]  [<ffffffff81276561>] new_sync_write+0x91/0xd0
[   52.080917]  [<ffffffff81276dd7>] vfs_write+0xb7/0x210
[   52.080919]  [<ffffffff81277aec>] SyS_write+0x5c/0xd0
[   52.080921]  [<ffffffff8188eee9>] ia32_do_call+0x13/0x13
[   52.080922] Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 66 66 90 c3 0f 1f 80 00 00 00 00 66 66 90 89 d1 <f3> a4 31 c0 66 66 90 c3 90 90 90 66 66 90 83 fa 08 0f 82 95 00 


etc
Comment 1 João M. S. Silva 2015-08-05 16:35:06 UTC
I get this error on A10- and A20-OLinuXino-LIME boards.

The system eventually boots after a few minutes.

Is there a workaround?
Comment 2 João M. S. Silva 2015-08-05 16:37:30 UTC
[   36.159857] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:19]
[   37.149858] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=2102 jiffies, g=878, c=877, q=51)
[   37.160158] All QSes seen, last rcu_sched kthread activity 2102 (-26288--28390), jiffies_till_next_fqs=1, root ->qsmask 0x0
[   37.171735] rcu_sched kthread starved for 2102 jiffies!
Comment 3 [account disabled by administrator] 2016-04-16 21:39:02 UTC
Greetings Joao, can we see if we hit both calls to mark_runtime_wc in the function substream_alloc_pages:
static int substream_alloc_pages(struct azx *chip,
                                  struct snd_pcm_substream *substream,
                                  size_t size)
{
        struct azx_dev *azx_dev = get_azx_dev(substream);
        int ret;

        mark_runtime_wc(chip, azx_dev, substream, false);
        ret = snd_pcm_lib_malloc_pages(substream, size);
        if (ret < 0)
                return ret;
        add_this_line:
        printk_rate_limited(KERN_CRIT "Reached\n");
        mark_runtime_wc(chip, azx_dev, substream, true);
        return 0;
}
The file has from the root of the kernel source directory to change is sound/pci/hda/hda_intel.c.
Comment 4 João M. S. Silva 2016-04-17 21:32:54 UTC
Hi Bastien,

I'm not sure how to try your suggestion. Do I need to build the kernel with a modified source file?

Anyway, I'm now away from home, where I have the Single Board Computers in which I got this error. I'm returning home on the 20th of May.
Comment 5 [account disabled by administrator] 2016-04-17 22:25:54 UTC
Yes that's exactly what I was asking if you need more help with it I can try but assume your OK with that. I don't known if your away for work or vacation if for vacation then we can wait until after the 20th otherwise we should continue fixing your issue. Further more you actually have two dmesg warn on outputs and the one you pasted was not the one causing a lockup so if your unsure of what output to paste into your bug report just upload the dmesg and lsmod as these two things are generally the most useful for lockups or kernel panics.
Comment 6 João M. S. Silva 2016-04-18 21:04:57 UTC
I'm away for work. You just caught me the day I was leaving home.

When I get back I can try to replicate the error and get the full logs, and also try kernel compilation.

Please advise/do as it seems better.
Comment 7 João M. S. Silva 2016-05-20 18:18:40 UTC
Hi Bastien,

I'm now home, with access to the A20-OLinuXino-LIME.

This is the way I build the GNU/Linux image for the board:

  https://eewiki.net/display/linuxonarm/A20-OLinuXino-LIME

However, these instructions have changed since the last time I built the image, so after I modified sound/pci/hda/hda_intel.c according to your instructions I ran:

  ./build_kernel.sh

but this script must contain some sort of self-update, because it is compiling lots of files, not just the one I modified.

Can I make this test with the current kernel being used in the above page?

https://eewiki.net/display/linuxonarm/A20-OLinuXino-LIME#A20-OLinuXino-LIME-LinuxKernel

I.e. can I test this with any kernel version?

Thanks.
Comment 8 [account disabled by administrator] 2016-05-21 03:01:31 UTC
Seems to work for newer kernels so try the instructions with my change and report back if there are any issues.
Comment 9 João M. S. Silva 2016-05-22 20:20:39 UTC
I'm having troubles creating a new image for the this board with the instructions from eewiki.net. The image does not seem to boot---no messages are output through the serial console. I'm trying to figure this out, but it may take a few days.
Comment 10 [account disabled by administrator] 2016-05-22 21:29:49 UTC
Is this the same kernel version you used before with the other instructions or a newer kernel version. If the kernel version is newer try using git bisect to find the bad commit.
Comment 11 João M. S. Silva 2016-05-30 20:55:53 UTC
I'm still trying to build a micro-SD card for either A10- or A20-OLinuXino-LIME. Meanwhile, Robert Nelson said:

"Bug 100241 is in a reference to a subsystem pci intel audio, which we don't have enabled.

"NMI watchdog is a generic error, we'd have to see the full dmesg to see what actually happened."

Don't know if this helps.
Comment 12 [account disabled by administrator] 2016-05-30 22:12:02 UTC
That's corrrect. I was going off the trace you were giving me as that has the function trace of the kernel lockup. However a full dmesg can show other warnings during your boot that finally cause the final lockup as you first posted, so a complete dmesg may help.

Note You need to log in before you can comment on or make changes to this bug.