Bug 196611 - NULL pointer dereference with cachestat
Summary: NULL pointer dereference with cachestat
Status: RESOLVED CODE_FIX
Alias: None
Product: Tracing/Profiling
Classification: Unclassified
Component: Ftrace (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Steven Rostedt
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-08 04:59 UTC by lilydjwg
Modified: 2017-11-10 15:41 UTC (History)
0 users

See Also:
Kernel Version: 4.12.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Check for null ret_stack on profile function graph entry (489 bytes, patch)
2017-08-17 17:34 UTC, Steven Rostedt
Details | Diff

Description lilydjwg 2017-08-08 04:59:24 UTC
When I run the cachestat script from perf-tools (https://github.com/brendangregg/perf-tools/blob/master/fs/cachestat), my kernel oops. Here's a piece of the log:

Aug  7 00:06:45 host kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
Aug  7 00:06:45 host kernel: IP: profile_graph_entry+0x70/0x160
Aug  7 00:06:45 host kernel: PGD 1bd7ca067 
Aug  7 00:06:45 host kernel: P4D 1bd7ca067 
Aug  7 00:06:45 host kernel: PUD 1c835e067 
Aug  7 00:06:45 host kernel: PMD 0 
Aug  7 00:06:45 host kernel: 
Aug  7 00:06:45 host kernel: Oops: 0002 [#1] PREEMPT SMP
Aug  7 00:06:45 host kernel: Modules linked in: fuse bbswitch(O) nvidia(PO) ctr ccm xt_REDIRECT nf_nat_redirect xt_comment xt_owner xt_hl xt_u32 xt_tcpudp iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc nls_iso8859_1 nls_cp437 vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media btusb btrtl arc4 intel_rapl snd_hda_codec_hdmi joydev mousedev iwlmvm snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support mac80211 snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel snd_hda_codec kvm_intel snd_hda_core thinkpad_acpi snd_hwdep snd_pcm snd_timer iwlwifi cfg80211 rtsx_pci_ms memstick input_leds tpm_crb e1000e kvm irqbypass ptp intel_cstate intel_rapl_perf
Aug  7 00:06:45 host kernel:  pps_core psmouse hci_uart btbcm btqca pcspkr btintel mei_me bluetooth i2c_i801 mei nvram ecdh_generic snd intel_pch_thermal soundcore rfkill shpchp crc16 led_class wmi thermal battery tpm_tis ac tpm_tis_core i2c_hid tpm evdev intel_lpss_acpi intel_lpss acpi_pad mac_hid sch_fq_codel pci_stub vboxpci(O) overlay vboxnetadp(O) vboxnetflt(O) vboxdrv(O) loop ip_tables x_tables xfs libcrc32c crc32c_generic algif_skcipher af_alg dm_crypt dm_mod dax uas usb_storage hid_generic usbhid hid sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 aesni_intel ahci libahci aes_x86_64 crypto_simd xhci_pci glue_helper cryptd libata xhci_hcd rtsx_pci usbcore scsi_mod usb_common i8042 serio i915 video button intel_gtt i2c_algo_bit drm_kms_helper
Aug  7 00:06:45 host kernel:  syscopyarea sysfillrect sysimgblt fb_sys_fops drm
Aug  7 00:06:45 host kernel: CPU: 2 PID: 12258 Comm: sh Tainted: P           O    4.12.4-1-lily #10
Aug  7 00:06:45 host kernel: Hardware name: LENOVO 20J6A01ACD/20J6A01ACD, BIOS R0FET31W (1.11 ) 03/10/2017
Aug  7 00:06:45 host kernel: task: ffff8801e74cbb80 task.stack: ffffc90009050000
Aug  7 00:06:45 host kernel: RIP: 0010:profile_graph_entry+0x70/0x160
Aug  7 00:06:45 host kernel: RSP: 0018:ffffc90009053c30 EFLAGS: 00010297
Aug  7 00:06:45 host kernel: RAX: 0000000000000000 RBX: ffffc90009053d60 RCX: 0000000000000000
Aug  7 00:06:45 host kernel: RDX: ffff8801e74cbb80 RSI: ffffc90009053d60 RDI: ffffc90009053c44
Aug  7 00:06:45 host kernel: RBP: ffffc90009053c30 R08: ffffea0008fdc11c R09: ffffea0008fdc100
Aug  7 00:06:45 host kernel: R10: 0000000000000000 R11: ffffc90009053ce0 R12: ffffffff811aa6d0
Aug  7 00:06:45 host kernel: R13: ffffffff811998b9 R14: 0000000000000000 R15: ffffea0008fdc100
Aug  7 00:06:45 host kernel: FS:  00007fe87ed91b40(0000) GS:ffff880252480000(0000) knlGS:0000000000000000
Aug  7 00:06:45 host kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  7 00:06:45 host kernel: CR2: 0000000000000018 CR3: 00000001bde96000 CR4: 00000000003406e0
Aug  7 00:06:45 host kernel: Call Trace:
Aug  7 00:06:45 host kernel:  prepare_ftrace_return+0x78/0xd0
Aug  7 00:06:45 host kernel:  ftrace_graph_caller+0x78/0xa8
Aug  7 00:06:45 host kernel:  ? mark_page_accessed+0x5/0x170
Aug  7 00:06:45 host kernel:  ? pagecache_get_page+0x2c/0x1f0
Aug  7 00:06:45 host kernel:  mark_page_accessed+0x5/0x170
Aug  7 00:06:45 host kernel:  generic_file_read_iter+0x2f9/0x700
Aug  7 00:06:46 host kernel:  ? mark_page_accessed+0x5/0x170
Aug  7 00:06:46 host kernel:  ? ftrace_graph_caller+0xa8/0xa8
Aug  7 00:06:46 host kernel:  ? pagevec_lru_move_fn+0xcb/0xf0
Aug  7 00:06:46 host kernel:  xfs_file_buffered_aio_read+0x47/0xe0 [xfs]
Aug  7 00:06:46 host kernel:  xfs_file_read_iter+0x62/0xb0 [xfs]
Aug  7 00:06:46 host kernel:  __vfs_read+0xe1/0x130
Aug  7 00:06:46 host kernel:  vfs_read+0x96/0x130
Aug  7 00:06:46 host kernel:  SyS_read+0x55/0xc0
Aug  7 00:06:46 host kernel:  entry_SYSCALL_64_fastpath+0x1e/0xa9
Aug  7 00:06:46 host kernel: RIP: 0033:0x7fe87e42af4c
Aug  7 00:06:46 host kernel: RSP: 002b:00007fff56c2f958 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Aug  7 00:06:46 host kernel: RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007fe87e42af4c
Aug  7 00:06:46 host kernel: RDX: 0000000000001000 RSI: 00000000018f3bb0 RDI: 0000000000000004
Aug  7 00:06:46 host kernel: RBP: 0000000000002710 R08: 0000000000000000 R09: 0000000000000010
Aug  7 00:06:46 host kernel: R10: 00000000018f4b8c R11: 0000000000000246 R12: 00007fe87e756ad8
Aug  7 00:06:46 host kernel: R13: 0000000000000070 R14: 0000000001909fd0 R15: 00007fe87e756a80
Aug  7 00:06:46 host kernel: Code: bd 00 45 85 c9 75 33 57 9d 0f 1f 44 00 00 83 f9 31 77 20 48 8d 04 89 65 48 8b 14 25 80 d3 00 00 48 8b 8a 88 18 00 00 48 8d 04 c1 <48> c7 40 18 00 00 00 00 b8 01 00 00 00 5d c3 48 b8 eb 83 b5 80 
Aug  7 00:06:46 host kernel: RIP: profile_graph_entry+0x70/0x160 RSP: ffffc90009053c30
Aug  7 00:06:46 host kernel: CR2: 0000000000000018
Aug  7 00:06:46 host kernel: ---[ end trace c59c5161a34998c3 ]---
Aug  7 00:06:46 host kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
Aug  7 00:06:46 host kernel: IP: profile_graph_entry+0x70/0x160
Aug  7 00:06:46 host kernel: PGD 1bd7ca067 
Aug  7 00:06:46 host kernel: P4D 1bd7ca067 
Aug  7 00:06:46 host kernel: PUD 1c835e067 
Aug  7 00:06:46 host kernel: PMD 0 
Aug  7 00:06:46 host kernel: 
Aug  7 00:06:46 host kernel: Oops: 0002 [#2] PREEMPT SMP
Aug  7 00:06:46 host kernel: Modules linked in: fuse bbswitch(O) nvidia(PO) ctr ccm xt_REDIRECT nf_nat_redirect xt_comment xt_owner xt_hl xt_u32 xt_tcpudp iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc nls_iso8859_1 nls_cp437 vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media btusb btrtl arc4 intel_rapl snd_hda_codec_hdmi joydev mousedev iwlmvm snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support mac80211 snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel snd_hda_codec kvm_intel snd_hda_core thinkpad_acpi snd_hwdep snd_pcm snd_timer iwlwifi cfg80211 rtsx_pci_ms memstick input_leds tpm_crb e1000e kvm irqbypass ptp intel_cstate intel_rapl_perf
Aug  7 00:06:46 host kernel:  pps_core psmouse hci_uart btbcm btqca pcspkr btintel mei_me bluetooth i2c_i801 mei nvram ecdh_generic snd intel_pch_thermal soundcore rfkill shpchp crc16 led_class wmi thermal battery tpm_tis ac tpm_tis_core i2c_hid tpm evdev intel_lpss_acpi intel_lpss acpi_pad mac_hid sch_fq_codel pci_stub vboxpci(O) overlay vboxnetadp(O) vboxnetflt(O) vboxdrv(O) loop ip_tables x_tables xfs libcrc32c crc32c_generic algif_skcipher af_alg dm_crypt dm_mod dax uas usb_storage hid_generic usbhid hid sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 aesni_intel ahci libahci aes_x86_64 crypto_simd xhci_pci glue_helper cryptd libata xhci_hcd rtsx_pci usbcore scsi_mod usb_common i8042 serio i915 video button intel_gtt i2c_algo_bit drm_kms_helper
Aug  7 00:06:46 host kernel:  syscopyarea sysfillrect sysimgblt fb_sys_fops drm
Aug  7 00:06:46 host kernel: CPU: 2 PID: 12258 Comm: sh Tainted: P      D    O    4.12.4-1-lily #10
Aug  7 00:06:46 host kernel: Hardware name: LENOVO 20J6A01ACD/20J6A01ACD, BIOS R0FET31W (1.11 ) 03/10/2017
Aug  7 00:06:46 host kernel: task: ffff8801e74cbb80 task.stack: ffffc90009050000
Aug  7 00:06:46 host kernel: RIP: 0010:profile_graph_entry+0x70/0x160
Aug  7 00:06:46 host kernel: RSP: 0018:ffffc90009053b80 EFLAGS: 00010297
Aug  7 00:06:46 host kernel: RAX: 0000000000000000 RBX: ffffc90009053cb0 RCX: 0000000000000000
Aug  7 00:06:46 host kernel: RDX: ffff8801e74cbb80 RSI: ffffc90009053cb0 RDI: ffffc90009053b94
Aug  7 00:06:46 host kernel: RBP: ffffc90009053b80 R08: ffff8802070d6c80 R09: 00000000004c6000
Aug  7 00:06:46 host kernel: R10: 0000000000000000 R11: ffffffff81ea85ad R12: ffffffff811aa6d0
Aug  7 00:06:46 host kernel: R13: ffffffff811cff2e R14: 0000000000000000 R15: ffffc90009053e10
Aug  7 00:06:46 host kernel: FS:  0000000000000000(0000) GS:ffff880252480000(0000) knlGS:0000000000000000
Aug  7 00:06:46 host kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  7 00:06:46 host kernel: CR2: 0000000000000018 CR3: 00000001bde96000 CR4: 00000000003406e0
Aug  7 00:06:46 host kernel: Call Trace:
Aug  7 00:06:46 host kernel:  prepare_ftrace_return+0x78/0xd0
Aug  7 00:06:46 host kernel:  ftrace_graph_caller+0x78/0xa8
Aug  7 00:06:46 host kernel:  ? profile_graph_entry+0x70/0x160
Aug  7 00:06:46 host kernel:  ? mark_page_accessed+0x5/0x170
Aug  7 00:06:46 host kernel:  mark_page_accessed+0x5/0x170
Aug  7 00:06:46 host kernel:  unmap_page_range+0x58e/0x990
Aug  7 00:06:46 host kernel:  ? mark_page_accessed+0x5/0x170
Aug  7 00:06:46 host kernel:  ? ftrace_graph_caller+0xa8/0xa8
Aug  7 00:06:46 host kernel:  ? set_next_entity+0x162/0x630
Aug  7 00:06:46 host kernel:  unmap_single_vma+0x7d/0xf0
Aug  7 00:06:46 host kernel:  unmap_vmas+0x51/0xb0
Aug  7 00:06:46 host kernel:  exit_mmap+0xab/0x190
Aug  7 00:06:46 host kernel:  mmput+0x3c/0xf0
Aug  7 00:06:46 host kernel:  do_exit+0x592/0xb40
Aug  7 00:06:46 host kernel:  ? SyS_read+0x55/0xc0
Aug  7 00:06:46 host kernel:  rewind_stack_do_exit+0x17/0x20
Aug  7 00:06:46 host kernel: RIP: 0033:0x7fe87e42af4c
Aug  7 00:06:46 host kernel: RSP: 002b:00007fff56c2f958 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Aug  7 00:06:46 host kernel: RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007fe87e42af4c
Aug  7 00:06:46 host kernel: RDX: 0000000000001000 RSI: 00000000018f3bb0 RDI: 0000000000000004
Aug  7 00:06:46 host kernel: RBP: 0000000000002710 R08: 0000000000000000 R09: 0000000000000010
Aug  7 00:06:46 host kernel: R10: 00000000018f4b8c R11: 0000000000000246 R12: 00007fe87e756ad8
Aug  7 00:06:46 host kernel: R13: 0000000000000070 R14: 0000000001909fd0 R15: 00007fe87e756a80
Aug  7 00:06:46 host kernel: Code: bd 00 45 85 c9 75 33 57 9d 0f 1f 44 00 00 83 f9 31 77 20 48 8d 04 89 65 48 8b 14 25 80 d3 00 00 48 8b 8a 88 18 00 00 48 8d 04 c1 <48> c7 40 18 00 00 00 00 b8 01 00 00 00 5d c3 48 b8 eb 83 b5 80 
Aug  7 00:06:46 host kernel: RIP: profile_graph_entry+0x70/0x160 RSP: ffffc90009053b80
Aug  7 00:06:46 host kernel: CR2: 0000000000000018
Aug  7 00:06:46 host kernel: ---[ end trace c59c5161a34998c4 ]---
Aug  7 00:06:46 host kernel: Fixing recursive fault but reboot is needed!
Aug  7 00:06:46 host kernel: BUG: scheduling while atomic: sh/12258/0x00000002
Aug  7 00:06:46 host kernel: Modules linked in: fuse bbswitch(O) nvidia(PO) ctr ccm xt_REDIRECT nf_nat_redirect xt_comment xt_owner xt_hl xt_u32 xt_tcpudp iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc nls_iso8859_1 nls_cp437 vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media btusb btrtl arc4 intel_rapl snd_hda_codec_hdmi joydev mousedev iwlmvm snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support mac80211 snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel snd_hda_codec kvm_intel snd_hda_core thinkpad_acpi snd_hwdep snd_pcm snd_timer iwlwifi cfg80211 rtsx_pci_ms memstick input_leds tpm_crb e1000e kvm irqbypass ptp intel_cstate intel_rapl_perf
Aug  7 00:06:46 host kernel:  pps_core psmouse hci_uart btbcm btqca pcspkr btintel mei_me bluetooth i2c_i801 mei nvram ecdh_generic snd intel_pch_thermal soundcore rfkill shpchp crc16 led_class wmi thermal battery tpm_tis ac tpm_tis_core i2c_hid tpm evdev intel_lpss_acpi intel_lpss acpi_pad mac_hid sch_fq_codel pci_stub vboxpci(O) overlay vboxnetadp(O) vboxnetflt(O) vboxdrv(O) loop ip_tables x_tables xfs libcrc32c crc32c_generic algif_skcipher af_alg dm_crypt dm_mod dax uas usb_storage hid_generic usbhid hid sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 aesni_intel ahci libahci aes_x86_64 crypto_simd xhci_pci glue_helper cryptd libata xhci_hcd rtsx_pci usbcore scsi_mod usb_common i8042 serio i915 video button intel_gtt i2c_algo_bit drm_kms_helper
Aug  7 00:06:46 host kernel:  syscopyarea sysfillrect sysimgblt fb_sys_fops drm
Aug  7 00:06:46 host kernel: CPU: 2 PID: 12258 Comm: sh Tainted: P      D    O    4.12.4-1-lily #10
Aug  7 00:06:46 host kernel: Hardware name: LENOVO 20J6A01ACD/20J6A01ACD, BIOS R0FET31W (1.11 ) 03/10/2017
Aug  7 00:06:46 host kernel: Call Trace:
Aug  7 00:06:46 host kernel:  dump_stack+0x63/0x8d
Aug  7 00:06:46 host kernel:  __schedule_bug+0x55/0x70
Aug  7 00:06:46 host kernel:  __schedule+0x6e4/0x860
Aug  7 00:06:46 host kernel:  ? vprintk_func+0x20/0x50
Aug  7 00:06:46 host kernel:  schedule+0x3d/0x90
Aug  7 00:06:46 host kernel:  do_exit+0x928/0xb40
Aug  7 00:06:46 host kernel:  ? SyS_read+0x55/0xc0
Aug  7 00:06:46 host kernel:  rewind_stack_do_exit+0x17/0x20
Aug  7 00:06:46 host kernel: RIP: 0033:0x7fe87e42af4c
Aug  7 00:06:46 host kernel: RSP: 002b:00007fff56c2f958 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Aug  7 00:06:46 host kernel: RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007fe87e42af4c
Aug  7 00:06:46 host kernel: RDX: 0000000000001000 RSI: 00000000018f3bb0 RDI: 0000000000000004
Aug  7 00:06:46 host kernel: RBP: 0000000000002710 R08: 0000000000000000 R09: 0000000000000010
Aug  7 00:06:46 host kernel: R10: 00000000018f4b8c R11: 0000000000000246 R12: 00007fe87e756ad8
Aug  7 00:06:46 host kernel: R13: 0000000000000070 R14: 0000000001909fd0 R15: 00007fe87e756a80

This is a lightly customized kernel based on the one from Arch Linux, with some options turned on and the uksm patch.

After I saw messages like the above, I stopped the script. Then I found I had a process stuck in D state.

This also happened with my 4.9.33 kernel. Kernel oops, then some processes entered D state forever, needing a forceful reset.

I have all the debug symbols available. The kernel can be downloaded from here:
https://repo.archlinuxcn.org/x86_64/linux-lily{,-debug,-headers}-4.12.4-1-x86_64.pkg.tar.xz{,.sig}

If you need more information, please tell me how to get it.
Comment 1 Steven Rostedt 2017-08-08 05:08:13 UTC
I've been informed that a Russian operative has infiltrated our boarders.
They are traveling to Las Vegas and plan on gambling away the secret Russian
tapes of Trump and his Russian escapades. I need to become the Dark Knight
and track them down before they have the ability to destroy the reputation
of the United States Presidency. I have been given a large sum of money, and
will be playing high stakes poker to make sure the tapes do not get into the
wrong hands.

This mission will last till Aug 13th and since I'll be under a secret
identity, I will not be able to reply back till the 14th.

Until then, same Bat Time, same Bat channel!

-- Steve
Comment 2 Steven Rostedt 2017-08-17 15:41:22 UTC
You have several out of tree modules (those labeled with '(O)') and a proprietary module (those labeled with '(P)', basically nvidia). Can you reproduce this bug without those loaded?

If so, please post another report of the bug without the tainted modules.

Thanks,

-- Steve
Comment 3 Steven Rostedt 2017-08-17 16:35:02 UTC
Nevermind. I just triggered the bug. I'll take a look.

-- Steve
Comment 4 Steven Rostedt 2017-08-17 17:34:48 UTC
Created attachment 257999 [details]
Check for null ret_stack on profile function graph entry

There's a small race when function graph shutsdown and the calling of the registered function graph entry callback. The callback must not reference the task's ret_stack without first checking that it is not NULL. Note, when a ret_stack is allocated for a task, it stays allocated until the task exits. The problem here, is that function_graph is shutdown, and a new task was created, which doesn't have its ret_stack allocated. But since some of the functions are still being traced, the callbacks can still be called.

The normal function_graph code handles this, but starting with commit 8861dd303c ("ftrace: Access ret_stack->subtime only in the function profiler") the profiler code references the ret_stack on function entry, but doesn't check if it is NULL first.
Comment 5 Steven Rostedt 2017-11-10 15:41:49 UTC
The fix is in mainline commit a8f0f9e49956a ("ftrace: Check for null ret_stack on profile function graph entry function")

Note You need to log in before you can comment on or make changes to this bug.