+++ This bug was initially created as a clone of Bug #13811 +++ I couldn't re-open the original bug (because I didn't report it), so I'm cloning this bug. It may even be better, because _something_ was fixed in that other bug, so this may be another bug with exactly the same symptoms. Anyway... I've got a ThinkPad X61s (i586 system) with Intel GM965 chipset. Running kernel 2.6.37.6 and xorg-x11-driver-video-7.6-53.58.1, KMS. After suspend from disk (using uswsusp), I get random memory corruption sometimes. I have already verified that: 1. it is highly unlikely to be a hardware failure The corruption happens at various physical locations, but always follows the same pattern (all zeroes or a series of 0x00aaaaaa in the first 16 bytes of a page). 2. it only happens after suspend-to-disk, not after suspend-to-RAM 3. the hibernation image contains correct data (without the corruption). I did that by booting a different OS after between hibernate and resume and examining the swap partition. 4. corruptions may occur both in the page cache and in anonymous mappings I did that by examining the state of the system after resume using the crash utility: A list of corrupted pages: PHYSICAL INDEX FILE 7a31c000 15 /usr/lib/libpolkit-gobject-1.so.0.0.0 7ab1a000 41 /usr/sbin/hald 7ab1b000 42 /usr/sbin/hald 7ab1c000 379 /usr/lib/libkdeui.so.5.6.0 7ab1f000 3c /usr/sbin/hald 7ab34000 a3f /usr/lib/xulrunner-2.0.1/libxul.so 7ab38000 1 /usr/lib/libaudiofile.so.0.0.2 7ab3c000 8186 anonymous (mapped by PID 23762 and PID 23786) 7ab3d000 8092 same anonymous mapping PID 23762 (konsole) was killed with SIGSEGV when trying to dereference an invalid pointer taken from 0x08186008 (inside the corrupted page). The corrupted pages are not in the vicinity of a MMIO region: 00000000-0000ffff : reserved 00010000-000997ff : System RAM 00099800-0009ffff : reserved 000a0000-000bffff : Video RAM area 000c0000-000cffff : Video ROM 000d0000-000d0fff : Adapter ROM 000d1000-000d1fff : Adapter ROM 000d6000-000d7fff : reserved 000e0000-000fffff : reserved 000f0000-000fffff : System ROM 00100000-7d6affff : System RAM 00200000-00603712 : Kernel code 00603713-00911707 : Kernel data 00984000-00a7ad97 : Kernel bss 7d6b0000-7d6cbfff : ACPI Tables 7d6cc000-7d6fffff : ACPI Non-volatile Storage 7d700000-7dffffff : reserved 7e000000-7fffffff : RAM buffer 80000000-83ffffff : PCI CardBus 0000:06 84000000-84000fff : Intel Flush Page dc100000-dfcfffff : PCI Bus 0000:02 dfcffc00-dfcfffff : 0000:02:00.0 dfe00000-dfefffff : PCI Bus 0000:02 dfe00000-dfe0ffff : 0000:02:00.0 e0000000-efffffff : 0000:00:02.0 f0000000-f3ffffff : PCI MMCONFIG 0000 [bus 00-3f] f0000000-f3ffffff : reserved f0000000-f3ffffff : pnp 00:02 f4000000-f7ffffff : PCI Bus 0000:05 f4000000-f7ffffff : PCI CardBus 0000:06 FWIW PCI device 0000:00:02.0 is the Intel graphics card: 00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (primary) (rev 0c) (prog-if 00 [VGA controller]) Subsystem: Lenovo T61 Flags: bus master, fast devsel, latency 0, IRQ 43 Memory at f8100000 (64-bit, non-prefetchable) [size=1M] Memory at e0000000 (64-bit, prefetchable) [size=256M] I/O ports at 1800 [size=8] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 3 Kernel driver in use: i915
How much memory do you have installed on that machine?
It can be seen from the /proc/iomem dump, but there are 2 GB of RAM in this machine. Two 1GB SODIMM modules using DDR2 on a 667 MHz synchronous bus, if you want me to be very specific...
*** Bug 13811 has been marked as a duplicate of this bug. ***
I'd like to verify whether or not highmem is related to the breakage, I'm not sure how to do that, though. Perhaps you can simply build a mainline kernel without highmem support and check if you can trigger the problem with that?
Good idea, Rafael, and I'm going to try it out. However, I noticed another thing. The memory corruption always happens in roughly the same region. I looked at what is mapped there after a fresh boot, and it seems there's a 3M shm chunk allocated at phys 7a935000 (or near). A closer look reveals that this is in fact graphic data and corresponds to the 32bpp background image (the laptop has a 1024x768 LCD, which takes 3M). All pages look similar to this: PAGE PHYSICAL MAPPING INDEX CNT FLAGS f56326a0 7a935000 f5f70b00 0 2 8008002c However, the associated shm mapping is already deleted: struct address_space { // @ 0xf5f70b00 host = 0xf5f70a38, ... } struct inode { // @ 0xf5f70a38 i_mode = 0100777, i_uid = 0, i_gid = 0, i_op = 0xc05c7a80, // shmem_inode_operations ... i_ino = 10068, ... i_nlink = 0, ... } Neither does "fuser 0xf5f70a38" give any result. However the refcount for the pages is still 2. Now, the interesting part is that after a resume from disk, the refcount was 1 (temporarily). It changed to 2 after a while, which looks a bit strange to me, but maybe the driver drops a reference to some pages before suspending, but can't cope with the situation if somebody else allocates the page for themselves after a resume. Anyway, if the above is true, then the same problem should also exist without highmem. Stay tuned.
First update - I was able to reproduce the corruption with 3.0.0-rc2 with CONFIG_HIGHMEM4G. I will now recompile with CONFIG_HIGHMEM=n and see what happens. Stay tuned.
I'm sorry for the delay. I had some issues compiling and booting a non-highmem kernel. However, the problem doesn't appear to be with highmem at all. I've just experienced this nice Oops on resuming: [ 8211.006006] BUG: unable to handle kernel paging request at 00aaaaae [ 8211.009185] IP: [<c02f7c17>] free_block+0x87/0x110 [ 8211.009185] *pde = 00000000 [ 8211.009185] Oops: 0002 [#1] SMP [ 8211.009185] Modules linked in: fuse usbhid snd_usb_audio hid snd_usbmidi_lib snd_rawmidi ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit af_packet ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables cpufreq_conservative cpufreq_userspace snd_pcm_oss cpufreq_powersave snd_mixer_oss acpi_cpufreq mperf snd_seq snd_seq_device sha256_generic cbc edd dm_crypt loop dm_mod arc4 snd_hda_codec_analog iwl4965 iwl_legacy mac80211 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm pcmcia sg cfg80211 firewire_ohci snd_timer e1000e firewire_core yenta_socket thinkpad_acpi pcmcia_rsrc pcmcia_core i2c_i801 battery crc_itu_t snd_page_alloc snd sdhci_pci rfkill sdhci pcspkr mmc_core ac soundcore uhci_hcd i915 drm_kms_helper rtc_cmos ehci_hcd drm i2c_algo_bit i2c_core usbcore video button fan processor ata_generic ata_piix ahci libahci libata thermal thermal_sys hwmon [ 8211.009185] [ 8211.009185] Pid: 11278, comm: kworker/0:3 Not tainted 3.0.0-rc3-pt-low #29 LENOVO 76693KG/76693KG [ 8211.009185] EIP: 0060:[<c02f7c17>] EFLAGS: 00010002 CPU: 0 [ 8211.009185] EIP is at free_block+0x87/0x110 [ 8211.009185] EAX: f1d6c7b0 EBX: f5414a00 ECX: f1d6c000 EDX: 00aaaaaa [ 8211.009185] ESI: 00000001 EDI: 00aaaaaa EBP: f478bf0c ESP: f478bef4 [ 8211.009185] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 8211.009185] Process kworker/0:3 (pid: 11278, ti=f478a000 task=f47912b0 task.ti=f478a000) [ 8211.009185] Stack: [ 8211.009185] f5407994 0000000b f5439e40 f5407980 f5414a00 f5439e40 f478bf2c c02f7dcd [ 8211.009185] 00000000 f5407994 0000000b f5439e40 f5414a00 f6792a00 f478bf4c c02f7f65 [ 8211.009185] 00000000 00000000 f7402de0 f7402de0 00000000 f6792a00 f478bf84 c0257f2c [ 8211.009185] Call Trace: [ 8211.009185] [<c02f7dcd>] drain_array+0x7d/0xc0 [ 8211.009185] [<c02f7f65>] cache_reap+0x75/0x100 [ 8211.009185] [<c0257f2c>] process_one_work+0xfc/0x370 [ 8211.009185] [<c0256ea0>] ? do_work_for_cpu+0x20/0x20 [ 8211.009185] [<c02f7ef0>] ? drain_freelist+0x90/0x90 [ 8211.009185] [<c02589dc>] worker_thread+0x12c/0x2c0 [ 8211.009185] [<c02588b0>] ? manage_workers+0x110/0x110 [ 8211.009185] [<c025d3d4>] kthread+0x74/0x80 [ 8211.009185] [<c025d360>] ? kthread_worker_fn+0x140/0x140 [ 8211.009185] [<c05f2636>] kernel_thread_helper+0x6/0xd [ 8211.009185] Code: 8b 0a 80 e5 80 0f 85 95 00 00 00 8b 0a 81 e1 80 00 00 00 0f 84 8f 00 00 00 8b 4a 1c 8b 7d 08 8b 55 f0 8b 5c ba 50 8b 39 8b 51 04 [ 8211.009185] 57 04 89 3a c7 01 00 01 10 00 c7 41 04 00 02 20 00 2b 41 0c [ 8211.009185] EIP: [<c02f7c17>] free_block+0x87/0x110 SS:ESP 0068:f478bef4 [ 8211.009185] CR2: 0000000000aaaaae [ 8217.222977] ---[ end trace 1657ce4ee4b55864 ]--- [ 8217.330465] BUG: unable to handle kernel paging request at fffffffc [ 8217.333695] IP: [<c025d5c9>] kthread_data+0x9/0x10 [ 8217.333695] *pde = 00838067 *pte = 00000000 [ 8217.333695] Oops: 0000 [#2] SMP [ 8217.333695] Modules linked in: fuse usbhid snd_usb_audio hid snd_usbmidi_lib snd_rawmidi ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit af_packet ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables cpufreq_conservative cpufreq_userspace snd_pcm_oss cpufreq_powersave snd_mixer_oss acpi_cpufreq mperf snd_seq snd_seq_device sha256_generic cbc edd dm_crypt loop dm_mod arc4 snd_hda_codec_analog iwl4965 iwl_legacy mac80211 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm pcmcia sg cfg80211 firewire_ohci snd_timer e1000e firewire_core yenta_socket thinkpad_acpi pcmcia_rsrc pcmcia_core i2c_i801 battery crc_itu_t snd_page_alloc snd sdhci_pci rfkill sdhci pcspkr mmc_core ac soundcore uhci_hcd i915 drm_kms_helper rtc_cmos ehci_hcd drm i2c_algo_bit i2c_core usbcore video button fan processor ata_generic ata_piix ahci libahci libata thermal thermal_sys hwmon [ 8217.333695] [ 8217.333695] Pid: 11278, comm: kworker/0:3 Tainted: G D 3.0.0-rc3-pt-low #29 LENOVO 76693KG/76693KG [ 8217.333695] EIP: 0060:[<c025d5c9>] EFLAGS: 00010002 CPU: 0 [ 8217.333695] EIP is at kthread_data+0x9/0x10 [ 8217.333695] EAX: 00000000 EBX: 00000000 ECX: c082e840 EDX: 00000000 [ 8217.333695] ESI: 00000000 EDI: f47912b0 EBP: f478bcd0 ESP: f478bcd0 [ 8217.333695] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 8217.333695] Process kworker/0:3 (pid: 11278, ti=f478a000 task=f47912b0 task.ti=f478a000) [ 8217.333695] Stack: [ 8217.333695] f478bce0 c0259c6c f7405840 00000000 f478bd78 c05e9be1 00000000 00000086 [ 8217.333695] c041a9a3 f6133808 f2360000 f236001c c082e840 c082e840 00000000 c07a2040 [ 8217.333695] c082e840 c082e840 f478bd30 e92cd2c8 c082e840 f47912b0 f478bd30 c02a249d [ 8217.333695] Call Trace: [ 8217.333695] [<c0259c6c>] wq_worker_sleeping+0xc/0x80 [ 8217.333695] [<c05e9be1>] schedule+0x531/0x6c0 [ 8217.333695] [<c041a9a3>] ? cfq_put_queue+0x63/0xe0 [ 8217.333695] [<c02a249d>] ? call_rcu_sched+0xd/0x10 [ 8217.333695] [<c04176a4>] ? cic_free_func+0x64/0x80 [ 8217.333695] [<c040c524>] ? put_io_context+0x34/0x50 [ 8217.333695] [<c040c524>] ? put_io_context+0x34/0x50 [ 8217.333695] [<c0244596>] do_exit+0x1d6/0x380 [ 8217.333695] [<c05ecb04>] oops_end+0x84/0xc0 [ 8217.333695] [<c022313e>] no_context+0xbe/0x150 [ 8217.333695] [<c042a40e>] ? sg_init_table+0x1e/0x40 [ 8217.333695] [<c0223258>] __bad_area_nosemaphore+0x88/0x130 [ 8217.333695] [<c05ee680>] ? spurious_fault+0xc0/0xc0 [ 8217.333695] [<c0223312>] bad_area_nosemaphore+0x12/0x20 [ 8217.333695] [<c05ee965>] do_page_fault+0x2e5/0x440 [ 8217.333695] [<f87cc869>] ? ahci_qc_prep+0x129/0x1c0 [libahci] [ 8217.333695] [<c0207136>] ? nommu_map_sg+0x46/0xa0 [ 8217.333695] [<c041bd6f>] ? cfq_select_queue+0x1ff/0x300 [ 8217.333695] [<f8889da6>] ? ata_scsi_translate+0x76/0x140 [libata] [ 8217.333695] [<c05ee680>] ? spurious_fault+0xc0/0xc0 [ 8217.333695] [<c05ec0e2>] error_code+0x5a/0x60 [ 8217.333695] [<c041007b>] ? diskstats_show+0x6b/0x3c0 [ 8217.333695] [<c04d007b>] ? bus_add_device+0x12b/0x1a0 [ 8217.333695] [<c05ee680>] ? spurious_fault+0xc0/0xc0 [ 8217.333695] [<c02f7c17>] ? free_block+0x87/0x110 [ 8217.333695] [<c02f7dcd>] drain_array+0x7d/0xc0 [ 8217.333695] [<c02f7f65>] cache_reap+0x75/0x100 [ 8217.333695] [<c0257f2c>] process_one_work+0xfc/0x370 [ 8217.333695] [<c0256ea0>] ? do_work_for_cpu+0x20/0x20 [ 8217.333695] [<c02f7ef0>] ? drain_freelist+0x90/0x90 [ 8217.333695] [<c02589dc>] worker_thread+0x12c/0x2c0 [ 8217.333695] [<c02588b0>] ? manage_workers+0x110/0x110 [ 8217.333695] [<c025d3d4>] kthread+0x74/0x80 [ 8217.333695] [<c025d360>] ? kthread_worker_fn+0x140/0x140 [ 8217.333695] [<c05f2636>] kernel_thread_helper+0x6/0xd [ 8217.333695] Code: b9 14 af 5f c0 8b 45 e4 e8 a5 f2 fd ff ba e4 7e 5f c0 8b 45 e4 e8 28 f8 fd ff 8b 45 e4 c9 c3 8d 76 00 55 8b 80 54 02 00 00 89 e5 <8b> 40 fc c9 c3 66 90 55 31 c0 89 e5 c9 c3 89 f6 8d bc 27 00 00 [ 8217.333695] EIP: [<c025d5c9>] kthread_data+0x9/0x10 SS:ESP 0068:f478bcd0 [ 8217.333695] CR2: 00000000fffffffc [ 8217.333695] ---[ end trace 1657ce4ee4b55865 ]--- [ 8217.333695] Fixing recursive fault but reboot is needed! Note the characteristic addresses (fffffffc, 00aaaaae). They correspond to both usual corruption patterns associated with this bug (namely all zeroes and a series of 00aaaaaa). What should I do next?
Hmm. In principle, hibernation should work without the graphics driver too, at least when it is started from a non-X console. Can you please verify if you can hibernate/resume with the graphics driver unloaded (from a text console) and, if so, whether or not the corruption is reproducible in that configuration?
Bear in mind the corruption looks suspiciously like pixel values.
(In reply to comment #9) > Bear in mind the corruption looks suspiciously like pixel values. And it seems that only folks with Intel graphics are affected. Here is a stupid idea: Given that graphics is using RAM here (i.e. no dedicated memory), is it possible that somehow the boundary gets crossed, which then causes graphics artefacts to be written into where kernel and processes are? In other words, are we sure that detection of what's where is working on every boot? And suspend doesn't suffer from these problems, so...
Created attachment 65202 [details] Patch to try @Petr: I wonder if this patch makes any difference?
I just tried it. Seems to work as advertised (did two hibernates and one suspend inbetween). There's only one nit: my Gnome3 session displays a password dialog on my external monitor, but that has not been woken up as part of the resume. No biggie though. Much better than random FS corruption.
Cool, thanks for the info.
Interesting... can you try just calling pci_enable_device and pci_enable_master in the thaw path instead (first one, then both) and see if that's what makes the difference? If decode is disabled for some reason it would probably be the enable that fixes things...
Sorry for my late reply, but I'm afraid the bug is not solved by rjw's patch. On a repeated resume I got the following (with the patch applied): BUG: unable to handle kernel NULL pointer dereference at 00000008 IP: [<c0436901>] strcmp+0x11/0x30 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: tun fuse ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit af_packet ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables cpufreq_conservative cpufreq_userspace cpufreq_powersave snd_pcm_oss snd_mixer_oss snd_seq acpi_cpufreq mperf sha256_generic cbc edd dm_crypt loop dm_mod arc4 iwl4965 iwl_legacy mac80211 snd_hda_codec_analog pcmcia snd_hda_intel snd_usb_audio snd_hda_codec snd_usbmidi_lib snd_hwdep snd_rawmidi snd_pcm firewire_ohci snd_seq_device yenta_socket pcmcia_rsrc thinkpad_acpi firewire_core snd_timer snd sdhci_pci sdhci cfg80211 mmc_core crc_itu_t sg pcspkr i2c_i801 pcmcia_core battery snd_page_alloc soundcore ac e1000e rfkill usbhid hid uhci_hcd i915 drm_kms_helper drm ehci_hcd usbcore rtc_cmos i2c_algo_bit i2c_core video essor ata_generic ata_piix ahci libahci libata thermal thermal_sys hwmon Pid: 5381, comm: ip6tables-batch Not tainted 3.0.0-pt+ #32 LENOVO 76693KG/76693KG EIP: 0060:[<c0436901>] EFLAGS: 00010292 CPU: 0 EIP is at strcmp+0x11/0x30 EAX: 00000008 EBX: 00000000 ECX: 00000000 EDX: fcc75e86 ESI: 00000008 EDI: fcc75e86 EBP: f1365d34 ESP: f1365d2c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process ip6tables-batch (pid: 5381, ti=f1364000 task=f13fb140 task.ti=f1364000) Stack: 00000168 fcc75e86 f1365d64 f891ab7d 00000001 c02f5f65 00000163 f1e65700 0a365d58 fffffffe 00c7bfff 0000000a fcc75e86 00000000 f1365d88 f891ace0 c02f5fcb f1e65700 f2b447a0 00000003 fcc75e84 fcc75de0 fcc75e84 f1365dec Call Trace: [<f891ab7d>] xt_find_target+0x7d/0x1c0 [x_tables] [<c02f5f65>] ? vmap_page_range_noflush+0x65/0xa0 [<f891ace0>] xt_request_find_target+0x20/0x6c [x_tables] [<c02f5fcb>] ? map_vm_area+0x2b/0x40 [<f89258c9>] find_check_entry+0x189/0x240 [ip6_tables] [<c02f6f9b>] ? __get_vm_area_node+0x10b/0x150 [<f8925b07>] translate_table+0x187/0x250 [ip6_tables] [<f892654b>] do_replace+0x9b/0x100 [ip6_tables] [<f8926605>] do_ip6t_set_ctl+0x55/0x80 [ip6_tables] [<c055d9b2>] nf_sockopt+0x42/0x80 [<c055da44>] nf_setsockopt+0x24/0x30 [<c05ccb33>] ipv6_setsockopt+0xa3/0xb0 [<c05d35bb>] rawv6_setsockopt+0x3b/0x120 [<c0531182>] sock_common_setsockopt+0x22/0x30 [<c053047f>] sys_setsockopt+0x5f/0xb0 [<c0530b7b>] sys_socketcall+0x13b/0x2b0 [<c0606e4c>] sysenter_do_call+0x12/0x22 Code: 78 06 ac aa 84 c0 75 f7 31 c0 aa 89 d8 8b 74 24 04 8b 1c 24 8b 7c 24 08 c9 c3 55 89 e5 83 ec 08 89 34 24 89 7c 24 04 89 c6 89 d7 <ac> ae 75 08 84 c0 75 f8 31 c0 eb 04 19 c0 0c 01 8b 34 24 8b 7c EIP: [<c0436901>] strcmp+0x11/0x30 SS:ESP 0068:f1365d2c CR2: 0000000000000008 ---[ end trace 1b53771b746cc619 ]--- It turns out the target list for xt[AF_INET6] is corrupted with exactly the same signature as seen previously (using physical addresses): crash> struct xt_target 0xf8a94000 struct xt_target { list = { next = 0x0, prev = 0x0 }, name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", revision = 0 '\000', target = 0xf8a92540, checkentry = 0xf8a92000, destroy = 0, me = 0xf8a94100, table = 0xf8a9321b "filter", targetsize = 4, hooks = 14, proto = 0, family = 10 } crash> vtop 0xf8a94000 VIRTUAL PHYSICAL f8a94000 7831f000 PAGE DIRECTORY: c0850000 PGD: c0850f88 => 366d4067 PMD: c0850f88 => 366d4067 PTE: 366d4a50 => 7831f163 PAGE: 7831f000 PTE PHYSICAL FLAGS 7831f163 7831f000 (PRESENT|RW|ACCESSED|DIRTY|GLOBAL) The first 32 bytes in the page are zeroed out: 000000007831f000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000007831f010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000007831f020 00 00 00 00 00 00 00 00 40 25 a9 f8 00 20 a9 f8 000000007831f030 00 00 00 00 00 41 a9 f8 1b 32 a9 f8 04 00 00 00 ...
ipv6_nlattr_to_tuple (code) is also corrupted the same way: crash> rd -8 ipv6_nlattr_to_tuple 64 f8a8d000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ f8a8d010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ f8a8d020: 04 8b 49 04 89 0a 8b 4b 04 89 4a 04 8b 4b 08 89 ..I....K..J..K.. f8a8d030: 4a 08 8b 4b 0c 89 4a 0c 8d 4a 14 8b 58 10 8d 43 J..K..J..J..X..C Compare this to nf_conntrack_ipv6.ko on disk: Contents of section .text: 0000 5589e553 bbeaffff ff8b480c 85c97508 U..S......H...u. 0010 89d85bc9 c38d7600 83781000 74f28d59 ..[...v..x..t..Y 0020 048b4904 890a8b4b 04894a04 8b4b0889 ..I....K..J..K.. 0030 4a088b4b 0c894a0c 8d4a148b 58108d43 J..K..J..J..X..C And yes, it's again at the beginning of a page.
I'm also affected on a acer aspire 1825 ptz. I did a few tests and it looks like rjw's patch does not even influence the frequency of corruptions. If there is anything to try test, please let me know.
could this be caused by this bug, too? https://bugzilla.kernel.org/show_bug.cgi?id=42691 -arne
BTW, still present in 3.3-rc7. Just FYI.
FYI: This does NOT happen with tuxonice (www.tuxonice.net). This a) might help to track down the issue (find, what tuxonice does different...) b) simpliy makes suspend to disk work again for people suffering from this bug
(In reply to comment #20) > FYI: This does NOT happen with tuxonice (www.tuxonice.net). > > This > > a) might help to track down the issue (find, what tuxonice does different...) > b) simpliy makes suspend to disk work again for people suffering from this > bug Can you specify which version of the kernel and tuxonice works?
I'm using it (again after in kernel suspend to disk got broken) since somewhat like 2.6.35 or so and it worked all the time (using tuxonice current). (Note: I'm using toi again since 2.6.35 does not mean the bug we are talking about was introduced in 2.6.35, it only means this time I was annoyed enough to search for an workaround). At the moment I'm running 3.3-rc7 with tuxonice-head from git (branch tuxonice-head from git://github.com/NigelCunningham/tuxonice-kernel.git). Tuxonice at the moment is version 3.2.1. Somewhat like git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cd linux git remote add tuxonice git://github.com/NigelCunningham/tuxonice-kernel.git git fetch --all git checkout -b toi git merge tuxonice/tuxonice-head Of course, that tends to fail on early rcs of an new kernel until Nigel updated his tree to reflect the changes in mainline.
(In reply to comment #22) > I'm using it (again after in kernel suspend to disk got broken) since > somewhat > like 2.6.35 or so and it worked all the time (using tuxonice current). > > (Note: I'm using toi again since 2.6.35 does not mean the bug we are talking > about was introduced in 2.6.35, it only means this time I was annoyed enough > to > search for an workaround). Interesting. TOI saves more memory than regular in-kernel hibernate, including page cache. I wonder whether that has something to do with the lack of corruption. So, maybe i915 finds all its bits back where they are supposed to be with TOI and as a result doesn't corrupt somebody else's pages. I know there are knobs in TOI to evict more stuff before hibernation. Would you mind trying those (i.e. save minimal amount of RAM possible in the image)?
> I know there are knobs in TOI to evict more stuff before hibernation. Would > you > mind trying those (i.e. save minimal amount of RAM possible in the image)? If there REALLY is an reason, why you cant compile an tuxonice enabled kernel and try yourself I'm willing to help, because I'm interested in seeing this bug being fixed. But then, please suggest an parameter/parameters to test, because when looking in Documentation/power/tuxonice.txt I do not see any parameter that looks to me like influencing the amount or kind of memory being stored or not stored. Also, I doubt that it is that simple. I guess, tuxonice basically does something different when resuming (if I remember correctly somebody in the past verified, that the corruption is not in the saved image, so it happens when resuming) and I doubt this is influenced by any parameter (just an gut feeling). If you indeed want me to test anything maybe we should agree that via email (not with comments in bugzilla) and summarize results later to stop clobbering people on CC probably not interested in that details.
(In reply to comment #24) > If there REALLY is an reason, why you cant compile an tuxonice enabled kernel > and try yourself I'm willing to help, because I'm interested in seeing this > bug > being fixed. I can. It just that you had it compiled, so I thought a quick test may show us more. > But then, please suggest an parameter/parameters to test, There is /sys/power/tuxonice/image_size_limit. If you set that to very low value, TOI will attempt to evict pages to comply with size. Or, you can even try with /sys/power/tuxonice/no_pageset2. This will limit the size of memory to 1/2, just like in-kernel swsusp. > Also, I doubt that it is that simple. I guess, tuxonice basically does > something different when resuming (if I remember correctly somebody in the > past > verified, that the corruption is not in the saved image, so it happens when > resuming) and I doubt this is influenced by any parameter (just an gut > feeling). Yes, it is probably not that simple.
I already played around with /sys/power/tuxonice/no_pageset2 this morning because I also read "does it like swsusp" and that sounded tempting. To check, if the bug is there I do something like: Once, create md5sums over various system files: find /bin /sbin /lib /lib64 /usr/lib /usr/lib64 -maxdepth 1 -type f -print0|xargs -0 md5sum -b >md5 Then, after suspend/resume cycle check that md5sums: md5sum -c md5 | grep -v. "OK$" That shows errors with in kernel swsusp quickly. If you know any better method to check, please let me know. Normally, image_size_limit is -2 and one would expect, that there is no disk activity when doing that md5sum -c after resume, because all that binaries were cached and cache was saved/restored. But there is. I've set image_size_limit to 1. One might expect, that it wont hibernate, because documentation for this parameter says "The maximum size of hibernation image written to disk, measured in megabytes" - maximum size, not try to not use more than... But hibernation works (and additional to what is shown with the default setting, userspace_ui says something like "trying to free xxxx MB of memory"). Now a finding that puzzles me much: When resuming from that (image_size_limit set to 1) and doing that md5sum -c there is no disk activity, so the effect of image_size_limit on whether caching of files works over suspend/resume is exactly the opposite from what I would have expected. But, with none of the tests above (no_pageset2 and image_size_limit) I ever saw an corruption.
Created attachment 72649 [details] Possibly related panic with 3.3-rc7+ (a day or so before 3.3) This is Fedora 16 with kernel 3.3-rc7+, compiled from git, on ThinkPad T510. After only a couple of hibernate/thaw cycles, I got this. Not sure right now whether i915 is somehow involved as well (usually it takes 20 to 25 cycles on my box to trigger the problem). Just for posterity.
And just to be sure I'm not going bananas, I just did 250 cycles with nomodeset of the recently released 3.3. I'm writing this in the FF of that resumed session. The only task that kept segfaulting was: ------------------ Mar 20 13:19:52 shrek kernel: [ 1932.492760] modem-manager[1838]: segfault at 44 ip 000000000042edba sp 00007fff6e387600 error 4 in modem-manager[400000+55000] ------------------ That is a a bug in that software. Otherwise, the machine is fine. I did see some ext4 shenanigans along the way: ------------------ Mar 20 22:02:29 shrek kernel: [ 3.763720] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:739: group 97, 9886 clusters in bitmap, 9883 in gd ------------------ Maybe this has to do with time - it is not 22:02:29 on Mar 20 here yet. Interestingly, i915 is actually loaded as a module, but it obviously isn't setting the mode (normally, I'd get 1600x900, without it I get 1280x800). So, yeah, still busted.
(In reply to comment #20) > FYI: This does NOT happen with tuxonice (www.tuxonice.net). > > This > > a) might help to track down the issue (find, what tuxonice does different...) > b) simpliy makes suspend to disk work again for people suffering from this > bug Just compiled current TOI head against the linux kernel head (3.3.0+) and with KMS I get "unexpected inconsistency" messages on resume (but I can resume), for my ext4 FS. Without KMS, I have not seen those, but I've seen this: ---------------------- Mar 21 17:33:11 shrek kernel: [ 95.521230] Pid: 2038, comm: pm-hibernate Not tainted 3.3.0+ #1 Mar 21 17:33:11 shrek kernel: [ 95.521324] Call Trace: Mar 21 17:33:11 shrek kernel: [ 95.521415] [<ffffffff8109d46d>] free_update_stats+0x7d/0x90 Mar 21 17:33:11 shrek kernel: [ 95.521508] [<ffffffff8109d54a>] toi_free_page+0x5a/0x70 Mar 21 17:33:11 shrek kernel: [ 95.521604] [<ffffffff810acaf5>] forget_signature_page+0x45/0xe0 Mar 21 17:33:11 shrek kernel: [ 95.521699] [<ffffffff810aa063>] toi_bio_cleanup+0x33/0x70 Mar 21 17:33:11 shrek kernel: [ 95.521800] [<ffffffff8109cee7>] toi_cleanup_modules+0x47/0x70 Mar 21 17:33:11 shrek kernel: [ 95.521890] [<ffffffff8109f365>] toi_finish_anything+0x15/0x90 Mar 21 17:33:11 shrek kernel: [ 95.521980] [<ffffffff8109da8a>] toi_attr_store+0x8a/0x310 Mar 21 17:33:11 shrek kernel: [ 95.522080] [<ffffffff811741d3>] ? alloc_pages_current+0xa3/0x110 Mar 21 17:33:11 shrek kernel: [ 95.522177] [<ffffffff8120256f>] sysfs_write_file+0xef/0x170 Mar 21 17:33:11 shrek kernel: [ 95.522270] [<ffffffff81194aa3>] vfs_write+0xb3/0x180 Mar 21 17:33:11 shrek kernel: [ 95.522359] [<ffffffff81194dca>] sys_write+0x4a/0x90 Mar 21 17:33:11 shrek kernel: [ 95.522449] [<ffffffff81610ca9>] system_call_fastpath+0x16/0x1b ---------------------- Some trouble with FS on resume as well without KMS (something about stale nodes being removed). Have to test more.
Yeah, just did about 30 cycles with KMS and TOI. Box still running, ext4 errors notwithstanding (this may be a different issue). I'll try to find out what TOI does differently when it comes to KMS.
All I can think of (from the TOI patch) is that certain pages (that belong to GEM) are atomically copied with the image in TOI. Which I guess means that unlike regular hibernate, the pages that belong to GEM are always preserved in the image. I'll ask Dave Jones to see whether this makes sense to him.
on my box TOI crashes, 2: http://bugzilla.tuxonice.net/show_bug.cgi?id=483
(In reply to comment #32) > on my box TOI crashes, 2: > http://bugzilla.tuxonice.net/show_bug.cgi?id=483 And when you specify nomodeset, does it work?
Created attachment 72739 [details] patch to try and fix issue. Suspend/resume fb device across hibernate.
(In reply to comment #34) > Created an attachment (id=72739) [details] > patch to try and fix issue. > > Suspend/resume fb device across hibernate. Just did 87 cycles of hibernate/thaw on my ThinkPad T510 with current head and this patch. This was never successful before. I'm writing this from that session. I will test more, but it sure looks promising.
Forgot to mention, the only segfaults are still these: ----------------- Mar 29 16:37:11 shrek kernel: [ 896.188486] modem-manager[23661]: segfault at 18 ip 0000000000435428 sp 00007fff28b4f390 error 4 ----------------- These are bugs in NetworkManager.
I did another 37 cycles now and the machine is still alive (total of 124 cycles, without a reboot). So, it seems that this indeed is the fix. I have never been able to do this many hibernate/thaw cycles without serious trouble on this machine, with i915/KMS. Writing this from that session. So, Dave Airlie, you are a genius! PS. NetworkManager did get seriously confused though and disconnected me from the LAN, but a restart fixed that. The only other thing I noticed was occasional whine by ext3/ext4 about inodes that have already been removed or some such.
Created attachment 72748 [details] Get defect font after a few hibernate/resume cycles With Daves patch it gets VERY much better, but after a few cycles (~20 in this case) I get corrupted fonts - see attached screenshot.
A patch referencing this bug report has been merged in Linux v3.4-rc1: commit 3fa016a0b5c5237e9c387fc3249592b2cb5391c6 Author: Dave Airlie <airlied@redhat.com> Date: Wed Mar 28 10:48:49 2012 +0100 drm/i915: suspend fbdev device around suspend/hibernate
Dave is awesome. Michael, please open a new bug (ideally at bugs.freedesktop.org) for your font issue; sounds like it's likely a separate problem.
This is definitely related with https://bugs.freedesktop.org/show_bug.cgi?id=28813 The commit 3fa016a0b5c5237e9c387fc3249592b2cb5391c6 improves the situation significantly, but does not resolve the problem completely (GM45, Thinkpad X301)
*** Bug 42643 has been marked as a duplicate of this bug. ***