Bug 88381 - [945G] System hang, rendering errors.
Summary: [945G] System hang, rendering errors.
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: intel-gfx-bugs@lists.freedesktop.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-18 03:50 UTC by Jim
Modified: 2015-02-11 13:53 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.18-rc5 (rc1 - rc5)
Tree: Mainline
Regression: Yes


Attachments
kernel .config file (125.59 KB, application/octet-stream)
2014-11-18 03:50 UTC, Jim
Details
88381.tar.7z (856.83 KB, application/x-7z-compressed)
2014-11-19 00:22 UTC, Jim
Details
dmesg and hw description (856.83 KB, application/octet-stream)
2014-11-19 00:27 UTC, Jim
Details
[PATCH] drm/i915: Don't call intel_prepare_page_flip() multiple times on gen2-4 (3.29 KB, patch)
2014-12-16 16:41 UTC, Ville Syrjala
Details | Diff

Description Jim 2014-11-18 03:50:26 UTC
Created attachment 157951 [details]
kernel .config file

Long running stable systems exhibit the following with all mods removed, as distributed. (I use my own scheduler and file system - the kernels below use the stock ext4 file system):

Nov 18 11:03:36 Aesop kernel: [    0.000000] Linux version 3.18.0-0-reaper (root@AESOP) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #5~rc5 SMP Tue Nov 18 09:54:47 PHT 2014
Nov 18 11:03:36 Aesop kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.18.0-0-reaper root=UUID=8536097e-c02a-4a8c-9cdf-b40ba4e3b74d ro crashkernel=384M-2G:64M,2G-:128M drm_kms_helper.edid_firmware=edid/1280x1024_75.bin thermal.crt=75 thermal.nocrt=70 quiet splash vt.handoff=7

CPU: Intel(R) Pentium(R) D CPU 3.40GHz (fam: 0f, model: 06, stepping: 04)
Memory: 2G
MB DMI: ECS 945GCT-M2/945GCT-M2, BIOS 080012  07/18/2008
Intel graphics stolen memory is: 0x7f800000-0x7fffffff

V3.18~rc5 will reward you with the following upon boot:

Nov 18 11:05:34 Aesop kernel: [  129.126366] ------------[ cut here ]------------
Nov 18 11:05:34 Aesop kernel: [  129.126428] WARNING: CPU: 0 PID: 2158 at /home/jim/software/ubuntu/linux-3.18-rc5/drivers/gpu/drm/i915/intel_display.c:9914 intel_check_page_flip+0xb8/0xc1 [i915]()
Nov 18 11:05:34 Aesop kernel: [  129.126431] Kicking stuck page flip: queued at 9583, now 9584
Nov 18 11:05:34 Aesop kernel: [  129.126433] Modules linked in: ctr ccm nf_log_ipv4 nf_log_common xt_tcpudp ip6table_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_TCPMSS xt_LOG ipt_REJECT iptable_mangle xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables lp cdc_ether usbnet arc4 rt2800usb rt2800lib crc_ccitt rt2x00usb rt2x00lib ipv6 mac80211 cfg80211 gspca_zc3xx gspca_main videodev snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_seq_dummy snd_pcm snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ppdev gpio_ich serio_raw lpc_ich mfd_core parport_pc parport it87 hwmon_vid uas usb_storage 8139too 8139cp mii i915 drm_kms_helper
Nov 18 11:05:34 Aesop kernel: [  129.126494] CPU: 0 PID: 2158 Comm: Xorg Not tainted 3.18.0-0-reaper #5~rc5
Nov 18 11:05:34 Aesop kernel: [  129.126496] Hardware name: ECS 945GCT-M2/945GCT-M2, BIOS 080012  07/18/2008
Nov 18 11:05:34 Aesop kernel: [  129.126499]  00000000000026ba ffff88007f403cc8 ffffffff816692ba 0000000000000007
Nov 18 11:05:34 Aesop kernel: [  129.126503]  ffff88007f403d18 ffff88007f403d08 ffffffff81045027 ffff88007f403d38
Nov 18 11:05:34 Aesop kernel: [  129.126508]  ffff880078db6008 ffff88007842bc30 0000000000000082 ffff88007842ba90
Nov 18 11:05:34 Aesop kernel: [  129.126512] Call Trace:
Nov 18 11:05:34 Aesop kernel: [  129.126514]  <IRQ>  [<ffffffff816692ba>] dump_stack+0x46/0x58
Nov 18 11:05:34 Aesop kernel: [  129.126527]  [<ffffffff81045027>] warn_slowpath_common+0x81/0x9f
Nov 18 11:05:34 Aesop kernel: [  129.126531]  [<ffffffff810450e8>] warn_slowpath_fmt+0x46/0x48
Nov 18 11:05:34 Aesop kernel: [  129.126561]  [<ffffffffa00862fa>] intel_check_page_flip+0xb8/0xc1 [i915]
Nov 18 11:05:34 Aesop kernel: [  129.126587]  [<ffffffffa0052f9a>] i915_handle_vblank+0x53/0xab [i915]
Nov 18 11:05:34 Aesop kernel: [  129.126615]  [<ffffffffa0059349>] i915_irq_handler+0x236/0x375 [i915]
Nov 18 11:05:34 Aesop kernel: [  129.126620]  [<ffffffff810850e1>] handle_irq_event_percpu+0x56/0x1b4
Nov 18 11:05:34 Aesop kernel: [  129.126624]  [<ffffffff81085279>] handle_irq_event+0x3a/0x61
Nov 18 11:05:34 Aesop kernel: [  129.126627]  [<ffffffff81087971>] handle_fasteoi_irq+0x7a/0xdc
Nov 18 11:05:34 Aesop kernel: [  129.126631]  [<ffffffff81004d26>] handle_irq+0x22/0x3c
Nov 18 11:05:34 Aesop kernel: [  129.126636]  [<ffffffff81675dc3>] do_IRQ+0x53/0xf0
Nov 18 11:05:34 Aesop kernel: [  129.126640]  [<ffffffff816741ea>] common_interrupt+0x6a/0x6a
Nov 18 11:05:34 Aesop kernel: [  129.126642]  <EOI>  [<ffffffff81672dab>] ? _raw_spin_unlock_irqrestore+0xe/0x10
Nov 18 11:05:34 Aesop kernel: [  129.126651]  [<ffffffff8114be7d>] slob_alloc.isra.11+0x1df/0x217
Nov 18 11:05:34 Aesop kernel: [  129.126655]  [<ffffffff8114bfca>] slob_alloc_node+0x115/0x1a1
Nov 18 11:05:34 Aesop kernel: [  129.126659]  [<ffffffff8114c069>] kmem_cache_alloc+0x13/0x15
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff815961c8>] __alloc_skb+0x43/0x23c
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff8116beb0>] ? pollwake+0x64/0x6a
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff81597079>] alloc_skb_with_frags+0x5d/0x1dd
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff81592132>] sock_alloc_send_pskb+0xe7/0x193
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff81635368>] unix_stream_sendmsg+0x303/0x40b
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff8158da40>] do_sock_write.isra.20+0xc1/0xe4
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff8158daab>] sock_aio_write+0x48/0x58
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff8115a0f8>] do_sync_readv_writev+0x48/0x75
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff8115b558>] do_readv_writev+0x1c5/0x2af
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff81093ac0>] ? hrtimer_start+0x18/0x1a
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff81094bff>] ? do_setitimer+0x277/0x2c4
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff8115b67d>] vfs_writev+0x3b/0x3d
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff8115b793>] SyS_writev+0x46/0x94
Nov 18 11:05:34 Aesop kernel: [  129.126664]  [<ffffffff81673596>] system_call_fastpath+0x16/0x1b
Nov 18 11:05:34 Aesop kernel: [  129.126664] ---[ end trace fc7a22abba15ff45 ]---


V3.17 and earlier kernels run like champs. This regression is independent of Mesa (versions 9.2.5 - 10.3.2), xf86-video-intel (any up to and including 2.99.916), libdrm up to and including v2.4.58, as well as other boards utilizing the i915 driver. I use the Gallium driver as I need OpenGL V2 support - I can't begin to express my gratefulness for the Gallium driver. This error also shows up on an Asus board as well as two (2) other Elite motherboards - what a misnomer that name is.

On 3.18-rc1, the page flip problem only happened if you ran an OpenGL consumer. Rc2-Rc5 would always log the problem at boot time with no OpenGL consumers. Then in rc4 AND rc5 performance worsened the over the course of about an hour and the machine hard locks up. The only recovery possible is to cycle the power. The hang leaves no trace whatsoever. Also introduced in rc5 only, after the Gnome desktop startup, the screen flickers and distorts rendering. A page table problem? (I also tried using i915.use_mmio_flip=1 as a parameter.)

Enclosed is the kernel .config file common to all kernels using Intel embedded graphics.

I donate my time in the Philippines teaching all ages computer science and as such I can only afford used hardware, or computers that are donated, in my classes. hence the importance of these old drivers. My newest machine (except my personal) is a core 2 processor. (LGA775 socketed motherboards.)

Besy regards,

Jim McDevitt
jimmcdevitt60@yahoo.com.ph
Comment 1 Daniel Vetter 2014-11-18 08:45:33 UTC
Can you please try to bisect where exactly this regression was introduced?

Also please boot with drm.debug=0xe and then attach the complete dmesg so we know what hw exactly you have.
Comment 2 Jim 2014-11-19 00:22:45 UTC
Created attachment 158081 [details]
88381.tar.7z

3.18-rc1 is where the problem started as I get available time, I have
been working to narrow down to the specific commit, but there was a
bit of activity on that release.

I also included info on the motherboard. Bin it if you want.

What else did you want? "Sorry,forgot, if you look" was cut off

On Tuesday, November 18, 2014, <bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=88381
>
> Daniel Vetter <daniel@ffwll.ch> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
M>              Status|NEW                         |NEEDINFO
>                  CC|                            |daniel@ffwll.ch
>
> --- Comment #1 from Daniel Vetter <daniel@ffwll.ch> ---
> Can you please try to bisect where exactly this regression was introduced?
>
> Also please boot with drm.debug=0xe and then attach the complete dmesg so we
> know what hw exactly you have.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
> You reported the bug.
>Sorry,forgot, if you look
Comment 3 Jim 2014-11-19 00:27:36 UTC
Created attachment 158091 [details]
dmesg and hw description

My email got lost in Globe. I'll post here just to be sure.

3.18-rc1 is where the problem started as I get available time, I have
been working to narrow down to the specific commit, but there was a
bit of activity on that release.

I also included info on the motherboard. Bin it if you want.

What else did you want? "Sorry,forgot, if you look" was cut off
Comment 4 Daniel Vetter 2014-11-19 09:33:27 UTC
Yeah I think we need the bisect, no clues from the logs.
Comment 5 Jim 2014-11-20 11:29:54 UTC
Comment from Malcolm:
>
> I also have the same hang on playing videos or graphic operations.
>
> use_mmio_flip=1 makes no difference
>
> I have bisected it to
> d6bbafa183793537d8dca4d4c2e448805e59448a
> drm/i915: Check for a stalled page flip after each vblank
> and
> 9c787942907face82da505c2c5493998b56cfc5a
> drm/i915: Decouple the stuck pageflip on modeset
>
> Reverting these commits restores trouble free operation.
>
> The trouble starts in i915_handle_vblank, I am just reverting just the
> changes
> made to this function for now.

As  Malcolm pointed out to me, I reversed those just in case I'm really missing something, all to no avail. I'll shoot for Sunday for the bisection.
Comment 6 Jim 2014-11-22 04:11:45 UTC
Bisect in early - promised son Sunday.

I am now fairly befuddled. I did this process on three machines. On the twin of this motherboard, a student did the bisection manually. On the machine I reported I wrote a script to try all possible combinations of what was pushed in rc1. When I came back to check, result was inconclusive. So. I looked a little more closely to the patch Malcolm suggested (by itself, it didn't solve my problem.)

Maybe my machines have different timings and maybe rc1 just exposed what was there all along. I then re-applied Malcolms' patch and re-ran my script. What it spit out was commit f0d3dad3. Lo and behold no more problem. The patch description:

   Author: Chris Wilson <chris@chris-wilson.co.uk>
   Date:   Sun Sep 7 16:51:12 2014 +0100

       drm/i915: Wrap -EIO send-vblank event for failed pageflip in spinlock
    
       drm_send_vblank_event() demands that we hold the event spinlock whilst
       calling it, so do so.
 
Looking at the code tells me this can't be the problem. I then extracted a fresh archive of rc1 and rc5 applied Malcolms' patch and reversed f0d3dad3. I rebuilt the entire kernel for both rc1 and rc5. No problems. If i do not revert f0d3dad3, I have the problem again. I installed the new rc1 kernel and also tried the rc5 kernel. Both were fine. I ran piglit and no problems except the the tests that always fail. So now, all 3 machines are happy.

All I want now is for somebody to explain to me WHY.

Thank you Malcolm for pointing me in the right direction.
Comment 7 Jim 2014-11-25 01:45:23 UTC
Just an update - 3.18-rc6 same story. Problem occurs with no patches installed, with just Malcoms' patch only, and with only commit f0d3dad3 reversed.

Trouble free with Malcolms' patch and commit f0d3dad3 reversed.

Looks like I should educate myself a bit more - some of the code I see is very hard to follow and sometimes I wonder what in the hell was that?

I'm an old microcoder and OS hacker; I never really had to get too involved with the graphics side of it.

Regards
Comment 8 Ville Syrjala 2014-12-16 16:41:56 UTC
Created attachment 160761 [details]
[PATCH] drm/i915: Don't call intel_prepare_page_flip() multiple times on gen2-4

I ran into similar problems on my 830 when frobbing around with the vblank code. I belieeve this patch should help. Please test.
Comment 9 Jani Nikula 2014-12-18 10:11:23 UTC
Fixed by

commit 7d47559ee84b3ac206aa9e675606fafcd7c0b500
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Wed Dec 17 23:08:03 2014 +0200

    drm/i915: Don't call intel_prepare_page_flip() multiple times on gen2-4

in drm-intel-next-fixes. Thanks for the report.
Comment 10 Jani Nikula 2015-02-11 13:53:44 UTC
*** Bug 91221 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.