Created attachment 157951 [details] kernel .config file Long running stable systems exhibit the following with all mods removed, as distributed. (I use my own scheduler and file system - the kernels below use the stock ext4 file system): Nov 18 11:03:36 Aesop kernel: [ 0.000000] Linux version 3.18.0-0-reaper (root@AESOP) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #5~rc5 SMP Tue Nov 18 09:54:47 PHT 2014 Nov 18 11:03:36 Aesop kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.18.0-0-reaper root=UUID=8536097e-c02a-4a8c-9cdf-b40ba4e3b74d ro crashkernel=384M-2G:64M,2G-:128M drm_kms_helper.edid_firmware=edid/1280x1024_75.bin thermal.crt=75 thermal.nocrt=70 quiet splash vt.handoff=7 CPU: Intel(R) Pentium(R) D CPU 3.40GHz (fam: 0f, model: 06, stepping: 04) Memory: 2G MB DMI: ECS 945GCT-M2/945GCT-M2, BIOS 080012 07/18/2008 Intel graphics stolen memory is: 0x7f800000-0x7fffffff V3.18~rc5 will reward you with the following upon boot: Nov 18 11:05:34 Aesop kernel: [ 129.126366] ------------[ cut here ]------------ Nov 18 11:05:34 Aesop kernel: [ 129.126428] WARNING: CPU: 0 PID: 2158 at /home/jim/software/ubuntu/linux-3.18-rc5/drivers/gpu/drm/i915/intel_display.c:9914 intel_check_page_flip+0xb8/0xc1 [i915]() Nov 18 11:05:34 Aesop kernel: [ 129.126431] Kicking stuck page flip: queued at 9583, now 9584 Nov 18 11:05:34 Aesop kernel: [ 129.126433] Modules linked in: ctr ccm nf_log_ipv4 nf_log_common xt_tcpudp ip6table_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_TCPMSS xt_LOG ipt_REJECT iptable_mangle xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables lp cdc_ether usbnet arc4 rt2800usb rt2800lib crc_ccitt rt2x00usb rt2x00lib ipv6 mac80211 cfg80211 gspca_zc3xx gspca_main videodev snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_seq_dummy snd_pcm snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore ppdev gpio_ich serio_raw lpc_ich mfd_core parport_pc parport it87 hwmon_vid uas usb_storage 8139too 8139cp mii i915 drm_kms_helper Nov 18 11:05:34 Aesop kernel: [ 129.126494] CPU: 0 PID: 2158 Comm: Xorg Not tainted 3.18.0-0-reaper #5~rc5 Nov 18 11:05:34 Aesop kernel: [ 129.126496] Hardware name: ECS 945GCT-M2/945GCT-M2, BIOS 080012 07/18/2008 Nov 18 11:05:34 Aesop kernel: [ 129.126499] 00000000000026ba ffff88007f403cc8 ffffffff816692ba 0000000000000007 Nov 18 11:05:34 Aesop kernel: [ 129.126503] ffff88007f403d18 ffff88007f403d08 ffffffff81045027 ffff88007f403d38 Nov 18 11:05:34 Aesop kernel: [ 129.126508] ffff880078db6008 ffff88007842bc30 0000000000000082 ffff88007842ba90 Nov 18 11:05:34 Aesop kernel: [ 129.126512] Call Trace: Nov 18 11:05:34 Aesop kernel: [ 129.126514] <IRQ> [<ffffffff816692ba>] dump_stack+0x46/0x58 Nov 18 11:05:34 Aesop kernel: [ 129.126527] [<ffffffff81045027>] warn_slowpath_common+0x81/0x9f Nov 18 11:05:34 Aesop kernel: [ 129.126531] [<ffffffff810450e8>] warn_slowpath_fmt+0x46/0x48 Nov 18 11:05:34 Aesop kernel: [ 129.126561] [<ffffffffa00862fa>] intel_check_page_flip+0xb8/0xc1 [i915] Nov 18 11:05:34 Aesop kernel: [ 129.126587] [<ffffffffa0052f9a>] i915_handle_vblank+0x53/0xab [i915] Nov 18 11:05:34 Aesop kernel: [ 129.126615] [<ffffffffa0059349>] i915_irq_handler+0x236/0x375 [i915] Nov 18 11:05:34 Aesop kernel: [ 129.126620] [<ffffffff810850e1>] handle_irq_event_percpu+0x56/0x1b4 Nov 18 11:05:34 Aesop kernel: [ 129.126624] [<ffffffff81085279>] handle_irq_event+0x3a/0x61 Nov 18 11:05:34 Aesop kernel: [ 129.126627] [<ffffffff81087971>] handle_fasteoi_irq+0x7a/0xdc Nov 18 11:05:34 Aesop kernel: [ 129.126631] [<ffffffff81004d26>] handle_irq+0x22/0x3c Nov 18 11:05:34 Aesop kernel: [ 129.126636] [<ffffffff81675dc3>] do_IRQ+0x53/0xf0 Nov 18 11:05:34 Aesop kernel: [ 129.126640] [<ffffffff816741ea>] common_interrupt+0x6a/0x6a Nov 18 11:05:34 Aesop kernel: [ 129.126642] <EOI> [<ffffffff81672dab>] ? _raw_spin_unlock_irqrestore+0xe/0x10 Nov 18 11:05:34 Aesop kernel: [ 129.126651] [<ffffffff8114be7d>] slob_alloc.isra.11+0x1df/0x217 Nov 18 11:05:34 Aesop kernel: [ 129.126655] [<ffffffff8114bfca>] slob_alloc_node+0x115/0x1a1 Nov 18 11:05:34 Aesop kernel: [ 129.126659] [<ffffffff8114c069>] kmem_cache_alloc+0x13/0x15 Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff815961c8>] __alloc_skb+0x43/0x23c Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff8116beb0>] ? pollwake+0x64/0x6a Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff81597079>] alloc_skb_with_frags+0x5d/0x1dd Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff81592132>] sock_alloc_send_pskb+0xe7/0x193 Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff81635368>] unix_stream_sendmsg+0x303/0x40b Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff8158da40>] do_sock_write.isra.20+0xc1/0xe4 Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff8158daab>] sock_aio_write+0x48/0x58 Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff8115a0f8>] do_sync_readv_writev+0x48/0x75 Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff8115b558>] do_readv_writev+0x1c5/0x2af Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff81093ac0>] ? hrtimer_start+0x18/0x1a Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff81094bff>] ? do_setitimer+0x277/0x2c4 Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff8115b67d>] vfs_writev+0x3b/0x3d Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff8115b793>] SyS_writev+0x46/0x94 Nov 18 11:05:34 Aesop kernel: [ 129.126664] [<ffffffff81673596>] system_call_fastpath+0x16/0x1b Nov 18 11:05:34 Aesop kernel: [ 129.126664] ---[ end trace fc7a22abba15ff45 ]--- V3.17 and earlier kernels run like champs. This regression is independent of Mesa (versions 9.2.5 - 10.3.2), xf86-video-intel (any up to and including 2.99.916), libdrm up to and including v2.4.58, as well as other boards utilizing the i915 driver. I use the Gallium driver as I need OpenGL V2 support - I can't begin to express my gratefulness for the Gallium driver. This error also shows up on an Asus board as well as two (2) other Elite motherboards - what a misnomer that name is. On 3.18-rc1, the page flip problem only happened if you ran an OpenGL consumer. Rc2-Rc5 would always log the problem at boot time with no OpenGL consumers. Then in rc4 AND rc5 performance worsened the over the course of about an hour and the machine hard locks up. The only recovery possible is to cycle the power. The hang leaves no trace whatsoever. Also introduced in rc5 only, after the Gnome desktop startup, the screen flickers and distorts rendering. A page table problem? (I also tried using i915.use_mmio_flip=1 as a parameter.) Enclosed is the kernel .config file common to all kernels using Intel embedded graphics. I donate my time in the Philippines teaching all ages computer science and as such I can only afford used hardware, or computers that are donated, in my classes. hence the importance of these old drivers. My newest machine (except my personal) is a core 2 processor. (LGA775 socketed motherboards.) Besy regards, Jim McDevitt jimmcdevitt60@yahoo.com.ph
Can you please try to bisect where exactly this regression was introduced? Also please boot with drm.debug=0xe and then attach the complete dmesg so we know what hw exactly you have.
Created attachment 158081 [details] 88381.tar.7z 3.18-rc1 is where the problem started as I get available time, I have been working to narrow down to the specific commit, but there was a bit of activity on that release. I also included info on the motherboard. Bin it if you want. What else did you want? "Sorry,forgot, if you look" was cut off On Tuesday, November 18, 2014, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=88381 > > Daniel Vetter <daniel@ffwll.ch> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- M> Status|NEW |NEEDINFO > CC| |daniel@ffwll.ch > > --- Comment #1 from Daniel Vetter <daniel@ffwll.ch> --- > Can you please try to bisect where exactly this regression was introduced? > > Also please boot with drm.debug=0xe and then attach the complete dmesg so we > know what hw exactly you have. > > -- > You are receiving this mail because: > You are on the CC list for the bug. > You reported the bug. >Sorry,forgot, if you look
Created attachment 158091 [details] dmesg and hw description My email got lost in Globe. I'll post here just to be sure. 3.18-rc1 is where the problem started as I get available time, I have been working to narrow down to the specific commit, but there was a bit of activity on that release. I also included info on the motherboard. Bin it if you want. What else did you want? "Sorry,forgot, if you look" was cut off
Yeah I think we need the bisect, no clues from the logs.
Comment from Malcolm: > > I also have the same hang on playing videos or graphic operations. > > use_mmio_flip=1 makes no difference > > I have bisected it to > d6bbafa183793537d8dca4d4c2e448805e59448a > drm/i915: Check for a stalled page flip after each vblank > and > 9c787942907face82da505c2c5493998b56cfc5a > drm/i915: Decouple the stuck pageflip on modeset > > Reverting these commits restores trouble free operation. > > The trouble starts in i915_handle_vblank, I am just reverting just the > changes > made to this function for now. As Malcolm pointed out to me, I reversed those just in case I'm really missing something, all to no avail. I'll shoot for Sunday for the bisection.
Bisect in early - promised son Sunday. I am now fairly befuddled. I did this process on three machines. On the twin of this motherboard, a student did the bisection manually. On the machine I reported I wrote a script to try all possible combinations of what was pushed in rc1. When I came back to check, result was inconclusive. So. I looked a little more closely to the patch Malcolm suggested (by itself, it didn't solve my problem.) Maybe my machines have different timings and maybe rc1 just exposed what was there all along. I then re-applied Malcolms' patch and re-ran my script. What it spit out was commit f0d3dad3. Lo and behold no more problem. The patch description: Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Sep 7 16:51:12 2014 +0100 drm/i915: Wrap -EIO send-vblank event for failed pageflip in spinlock drm_send_vblank_event() demands that we hold the event spinlock whilst calling it, so do so. Looking at the code tells me this can't be the problem. I then extracted a fresh archive of rc1 and rc5 applied Malcolms' patch and reversed f0d3dad3. I rebuilt the entire kernel for both rc1 and rc5. No problems. If i do not revert f0d3dad3, I have the problem again. I installed the new rc1 kernel and also tried the rc5 kernel. Both were fine. I ran piglit and no problems except the the tests that always fail. So now, all 3 machines are happy. All I want now is for somebody to explain to me WHY. Thank you Malcolm for pointing me in the right direction.
Just an update - 3.18-rc6 same story. Problem occurs with no patches installed, with just Malcoms' patch only, and with only commit f0d3dad3 reversed. Trouble free with Malcolms' patch and commit f0d3dad3 reversed. Looks like I should educate myself a bit more - some of the code I see is very hard to follow and sometimes I wonder what in the hell was that? I'm an old microcoder and OS hacker; I never really had to get too involved with the graphics side of it. Regards
Created attachment 160761 [details] [PATCH] drm/i915: Don't call intel_prepare_page_flip() multiple times on gen2-4 I ran into similar problems on my 830 when frobbing around with the vblank code. I belieeve this patch should help. Please test.
Fixed by commit 7d47559ee84b3ac206aa9e675606fafcd7c0b500 Author: Ville Syrjälä <ville.syrjala@linux.intel.com> Date: Wed Dec 17 23:08:03 2014 +0200 drm/i915: Don't call intel_prepare_page_flip() multiple times on gen2-4 in drm-intel-next-fixes. Thanks for the report.
*** Bug 91221 has been marked as a duplicate of this bug. ***