Bug 15293 - Flash video laggy inside Firefox only with KMS
Summary: Flash video laggy inside Firefox only with KMS
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks: 14885
  Show dependency tree
 
Reported: 2010-02-13 14:53 UTC by Michał Witkowski
Modified: 2010-03-21 20:05 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.33-rc7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Sysprof trace of first 2 minutes of the video (gzipped) (34.79 KB, application/x-gzip)
2010-02-13 18:06 UTC, Michał Witkowski
Details

Description Michał Witkowski 2010-02-13 14:53:12 UTC
System specs:
Arch Linux (20100212) x86_64
Lenovo U330 with Radeon 3450 and Intel 4500MHD
Kernel: 2.6.33-rc7 vanilla
KDE SC 4.4 (from KDEmod)
mesa: git 20100212
libdrm: git 20100212
xf86-video-ati: git 20100212
xf86-video-intel: git 20100212
flashplugin: 10.0.45.2 x86_64
Firefox 3.6
Running radeon KMS and intel KMS (switchable graphics).

I know it's hard to debug problems with closed-source things, but I guess there is a regression in KMS code regarding video playback in flashplugin. It happens for many flash video players, but not all (doesn't happen for youtube)

For example:
http://vimeo.com/6194911
I've got laggy in-flash video buttons, the playback is skippy, I have one core maxed-out. Firefox is also laggy, RMS context-menu takes 2-3 seconds to open. When I switch to full screen mode, everything is fine and dandy: playback is smooth, flash video control buttons work. This seems like an issue with video rendering inside other windows.

It doesn't happen for Intel KMS or the same radeon driver working in UMS. Happens regardless of compositing on/off.

I'm not the only one with this problem:
http://bbs.archlinux.org/viewtopic.php?pid=707955#p707955

I don't know if I posted this regression into the right bugtracker. I thought since this issue arises only in KMS and not UMS, it's a kernel module problem.
Comment 1 Vi0L0 2010-02-13 15:51:21 UTC
Same bug with this flash video here.

System specs:
Arch Linux (20100213) x86_64
Desktop with Intel quad q6600 (2,4GHz) and Radeon hd 4850
Kernel: 2.6.33-rc7-git1 vanilla
KDE SC 4.4
mesa: git 20100213
libdrm: git 20100213
xf86-video-ati: git 20100213
flashplugin: 10.1_p2 32-bit (through nspluginwrapper)
Firefox 3.5.7
Running radeon KMS
Comment 2 Rafał Miłecki 2010-02-13 16:14:49 UTC
I saw that on vimeo, thought it's flash issue. Have to test that again then.

Do you use dynpm=1?
Comment 3 Vi0L0 2010-02-13 17:07:24 UTC
Tested also on:
Kernel: 2.6.33-rc8 vanilla
Browsers: Opera 10.10, Arora 0.10.2

dynpm=1 is braking my kms...
Comment 4 Michał Witkowski 2010-02-13 17:16:37 UTC
Nope, I'm not using dynpm.

At first I also thought it's a flash issue. Later however, I've watched the same video with using the Intel card and it didn't skip or lag. That made my investigate the problem.
Comment 5 Pauli 2010-02-13 17:27:05 UTC
> --- Comment #4 from Michał Witkowski <neuro@o2.pl>  2010-02-13 17:16:37 ---
> Nope, I'm not using dynpm.
>
> At first I also thought it's a flash issue. Later however, I've watched the
> same video with using the Intel card and it didn't skip or lag. That made my
> investigate the problem.
>

Can you run sysprofile to see what functions are eating the cpu time
while you watch flash?

http://www.daimi.au.dk/~sandmann/sysprof/ if your distro doesn't provide one :)

It is probably some slow software path that flash is hitting.
Comment 6 Michał Witkowski 2010-02-13 18:06:51 UTC
Created attachment 25033 [details]
Sysprof trace of first 2 minutes of the video (gzipped)

Trace environment:
Running KMS radeon, no compositing. KDE is running, Kwin, Firefox with a single tab containing the flash video. First 2 minutes.
Comment 7 Rafał Miłecki 2010-02-16 20:57:11 UTC
Michał: did radeon KMS in 2.6.32 work any better for you (I mean this flash issue ofc)?
Comment 8 Jérôme Glisse 2010-02-17 09:32:40 UTC
What is you Xorg server version ? If it's older than the lastest 1.7 then it's likely the issue. Also i don't think it's a regression.
Comment 9 Rafał Miłecki 2010-02-17 09:42:45 UTC
Whoops, ooooldish 1.6.5 here. Thanks for pointing, will try updating when I'll find some free day :)
Comment 10 Vi0L0 2010-02-17 12:45:00 UTC
xorg-server 1.7.4.901 here.
Also site with many flash elements (like some myspace profiles) are much more laggy with kms than with ums.
Comment 11 Pauli 2010-02-17 15:10:01 UTC
I did look at the provided profile data but it is mostly useless because missing debug symbols. It would be nice if it could be repeated with debug symbols installed for kernel, xserver and ddx driver.

I did some profiling in my AGP system to see what there might be going on.

Problematic places are for the vimeo video in original report:

1. 26% cpu time goes for allocating bo to GTT for exaGetImage (in DFS). Real problem in allocation is cache flush when changing pages from WB to WC and purge_vmap_area_lazy in ttm. But as there is report from PCIE users I don't believe this is problem for them.

3. linflashplayer.so taking directly 11% of cpu time and indirectly 12% by calling gtk/gdk.

4. 14% cpu time is oing for memcpy from GTT to system memory. This is far me that memcpy from system memory to GTT. I guess the WC caching is slowing down the operation. But I would still need to run some micro benchmarks to locate the problem

5. UTS taking 7%.

That totals to 70% of cpu utilization. Firefox showing 26% cpu time total but that includes 23% for flash and only 3% for firefox.

Is it possible to skip the blit to scratch in PCIe systems? Skipping the scratch would reduce memory bandwidth use quite nicely for large flash videos. 

Specially when flash is wasting memory bandwidth already a lot. Data flow in flash video playback is system->VRAM->system->VRAM which is causing multiple times memory bandwidth use when compared to simple video playback.

Idea for DFS handler optimization for AGP systems:
preallocate 2 scratch buffers to GTT (maybe 256k each?) for all DFS and UTS operations
function DFS:
send 2 blits (from vram to scratch) commands to GPU with fence between.
i = 0;
while (data to copy) {
map scratch[i]
memcpy scratch[i] to system memory
unmap scratch[i]
if ( more to read from vram ) {
send blit from vram to scratch[i]
}
i = 1 - 0;
}


Here seems to be multiple performance bugs that flash is triggering to cause the effects which this bug report is about.
The largest bug seems to expensive buffer object allocation to GTT. I don't know if this can be fixed in TTM code but at least ddx could reduce number of allocations.
Next largest bug is that memcpy is very expensive when doing the copy from GTT to system memory. I don't know why or how to fix it without some micro benchmarks.
Comment 12 Alex Deucher 2010-02-17 15:48:09 UTC
The radeon driver does vline range waits to prevent tearing when doing buffer swaps.  That is likely the case if flash uses GL.  You can remove the vline waits by removing:
info->accel_state->vsync = TRUE;
in radeon_dri2_copy_region() in radeon_dri2.c in xf86-video-ati.
Comment 13 Michel Dänzer 2010-02-17 15:59:54 UTC
(In reply to comment #11)
I'm not sure why any DownloadFromScreen operations would be necessary in the fast path for playing a video... Might be better to find and fix the cause of that.
Comment 14 Luca Tettamanti 2010-02-17 16:46:05 UTC
(In reply to comment #11)
> 1. 26% cpu time goes for allocating bo to GTT for exaGetImage (in DFS). Real
> problem in allocation is cache flush when changing pages from WB to WC and
> purge_vmap_area_lazy in ttm. But as there is report from PCIE users I don't
> believe this is problem for them.

With a PCIe card exaGetImage barely registers on oprofile. I also see very few calls to radeon_bo_create (if that's the function you're referring to).

> Here seems to be multiple performance bugs that flash is triggering to cause
> the effects which this bug report is about.

It also seems specific to vimeo... with a normal youtube video the flash player is using up about 30% of the cpu and I see bit of activity (5%) in pixmap (pixman_blt_sse2).

With the video from the OP memcpy (called by Xorg) is hogging the CPU (the same behaviour shows in the profile attacched to this bug); however I'm unable to get a call graph from oprofile... the relevant debugging symbols are installed, do you have any suggestions?
Comment 15 Pauli 2010-02-17 16:56:52 UTC
If anything that has been leaked about flash video playback is true
then flash is using DFS to get video frmes for postprocessing (adding
UI elements).
We should prevent the first copy to vram but knowing when upload to
vram shouldn't be done might be quite hard.

> --- Comment #13 from Michel Dänzer <michel@daenzer.net>  2010-02-17 15:59:54
> ---
> (In reply to comment #11)
> I'm not sure why any DownloadFromScreen operations would be necessary in the
> fast path for playing a video... Might be better to find and fix the cause of
> that.
>
Comment 16 Luca Tettamanti 2010-02-17 17:11:15 UTC
(In reply to comment #15)
> If anything that has been leaked about flash video playback is true
> then flash is using DFS to get video frmes for postprocessing (adding
> UI elements).

This could explain the difference between youtube (no elements on top of the video) and vimeo (stuff over the video)... is it known what flash uses to handle the video (I don't see libXv loaded)?

This is the relevant post:
http://blogs.adobe.com/penguin.swf/2010/01/solving_different_problems.html
Comment 17 Rafael J. Wysocki 2010-02-21 22:04:18 UTC
Not a regression, dropping from the list of recent regressions.
Comment 18 Rafael J. Wysocki 2010-02-21 22:06:23 UTC
On Sunday 21 February 2010, Rafał Miłecki wrote:
> 2010/2/21 Rafael J. Wysocki <rjw@sisk.pl>:
> > Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=15293
> > Subject         : Flash video laggy inside Firefox only with KMS
> > Submitter       : Michał Witkowski <neuro@o2.pl>
> > Date            : 2010-02-13 14:53 (9 days old)
> 
> Not regression. Just KMS problem.
Comment 19 Rafał Miłecki 2010-02-21 22:32:42 UTC
(In reply to comment #12)
> The radeon driver does vline range waits to prevent tearing when doing buffer
> swaps.  That is likely the case if flash uses GL.  You can remove the vline
> waits by removing:
> info->accel_state->vsync = TRUE;
> in radeon_dri2_copy_region() in radeon_dri2.c in xf86-video-ati.

Does not help.
Comment 20 Vi0L0 2010-02-22 21:45:55 UTC
With midori browser flash seems to work much faster.
In fact i've got no lags on our vimeo flash video example when using midori...
So maybe it's not kms problem at all(?)
Comment 21 Michał Witkowski 2010-02-22 22:19:19 UTC
(In reply to comment #20)
> With midori browser flash seems to work much faster.
> In fact i've got no lags on our vimeo flash video example when using
> midori...
> So maybe it's not kms problem at all(?)


Todays (20100222) git of xf86-video-ati libdrm glproto and mesa.

Yes, I can confirm. In midori 0.2.3 there is no lag. Firefox, Chromium and Arora still laggy, though. And the video is not laggy while using latest xf86-video-intel on my 4500MHD.
Comment 22 Michał Witkowski 2010-02-23 18:41:35 UTC
Similar issues appear _without_ flash video in Firefox:

http://www.afiestas.org/fulfilling-promises-apple-wireless-keyboard-working-with-kbluetooth/

Everything appears really really laggy whenever you hover the video and controls appear. Could someone please confirm?
Comment 23 Vi0L0 2010-02-23 20:28:48 UTC
(In reply to comment #22)
> Similar issues appear _without_ flash video in Firefox:
> 
>
> http://www.afiestas.org/fulfilling-promises-apple-wireless-keyboard-working-with-kbluetooth/
> 
> Everything appears really really laggy whenever you hover the video and
> controls appear. Could someone please confirm?

I can confirm - really really laggy...

kernel 2.6.33-rc8-git5, yesterday's git of xf86-video-ati libdrm glproto and mesa
Comment 24 Michel Dänzer 2010-02-24 10:30:20 UTC
(In reply to comment #15)
> If anything that has been leaked about flash video playback is true
> then flash is using DFS to get video frmes for postprocessing (adding
> UI elements).

Assuming by 'DFS' you mean XGetImage, IMO that would be NOTOURBUG as XGetImage just can't be relied on for good performance in general.

Also everybody please be careful not to turn this into one of those infamous monster bug reports by mixing in several possibly unrelated issues.
Comment 25 Pauli 2010-02-27 06:39:33 UTC
I vote for NOTOURBUG too. 

Maybe ddx could be a bit better optimized for XGetSubImage that flash is using but in end we can't get it perform well enough anyway without modifying the blob.

So what exactly is happening to slow down here:

-Flash is playing the video to vram.
-Flash calls XGetSubImage multiple times per frame to read back the whole video.
-XGetSubImage tries to copy data from VRAM to system memory but there seems to be huge latency for CPU to read the data that was just written by GPU.

Hack patch: http://www.youtube.com/watch?v=JLkGGNbBJeI ;)
Comment 26 Anonymous Emailer 2010-02-28 12:52:24 UTC
Reply-To: andyqos@ukfsn.org

Andy Furniss wrote:

> Out of curiosity, why does it work OK with UMS?

To possibly answer my own question for my case - AGP cards have accel 
UTS/DFS disabled in UMS.
Comment 27 Anonymous Emailer 2010-02-28 13:59:04 UTC
Reply-To: andyqos@ukfsn.org

bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15293
>
>
>
>
>
> --- Comment #25 from Pauli<suokkos@gmail.com>   2010-02-27 06:39:33 ---
> I vote for NOTOURBUG too.
>
> Maybe ddx could be a bit better optimized for XGetSubImage that flash is
> using
> but in end we can't get it perform well enough anyway without modifying the
> blob.
>
> So what exactly is happening to slow down here:
>
> -Flash is playing the video to vram.
> -Flash calls XGetSubImage multiple times per frame to read back the whole
> video.
> -XGetSubImage tries to copy data from VRAM to system memory but there seems
> to
> be huge latency for CPU to read the data that was just written by GPU.
>
> Hack patch: http://www.youtube.com/watch?v=JLkGGNbBJeI ;)
>

Out of curiosity, why does it work OK with UMS?

I can't see people being too pleased when their distros switch to KMS 
and they the can't play some youtube vids that they could previously.
Comment 28 Anonymous Emailer 2010-02-28 15:57:51 UTC
Reply-To: andyqos@ukfsn.org

Andy Furniss wrote:
> Andy Furniss wrote:
>
>> Out of curiosity, why does it work OK with UMS?
>
> To possibly answer my own question for my case - AGP cards have accel
> UTS/DFS disabled in UMS.

This does seem to be the case, if I enable uts/dfs in UMS (with 
BusTypePCIE) I run out of CPU while playing a youtube vid.

One possibly interesting thing I found is that disabling dfs when using 
KMS has no effect - but Option "EXANoUploadToScreen" cures the problem.

It still eats more CPU than UMS - but then so does xv and x11.
Comment 29 Michał Witkowski 2010-02-28 16:24:11 UTC
(In reply to comment #25)
> I vote for NOTOURBUG too. 
> 
> Maybe ddx could be a bit better optimized for XGetSubImage that flash is
> using
> but in end we can't get it perform well enough anyway without modifying the
> blob.
> 
> So what exactly is happening to slow down here:
> 
> -Flash is playing the video to vram.
> -Flash calls XGetSubImage multiple times per frame to read back the whole
> video.
> -XGetSubImage tries to copy data from VRAM to system memory but there seems
> to
> be huge latency for CPU to read the data that was just written by GPU.
> 
> Hack patch: http://www.youtube.com/watch?v=JLkGGNbBJeI ;)

Right. And what about the problem with in-Firefox videos that don't use flash? I.e. comment #22? That also is miserably slow in KMS.

Also, if this bug gets NOTOURBUGged it would mean that flash video will be terribly slow for all radeon KMS users, right? 

http://xkcd.com/619/ comes to mind ;)
Comment 30 Rafał Miłecki 2010-02-28 19:01:21 UTC
(In reply to comment #28)
> Reply-To: andyqos@ukfsn.org
> 
> Andy Furniss wrote:
> > Andy Furniss wrote:
> >
> >> Out of curiosity, why does it work OK with UMS?
> >
> > To possibly answer my own question for my case - AGP cards have accel
> > UTS/DFS disabled in UMS.
> 
> This does seem to be the case, if I enable uts/dfs in UMS (with 
> BusTypePCIE) I run out of CPU while playing a youtube vid.

That's weird because we don't have problems with YouTube. It's about Vimeo.



(In reply to comment #26)
> Reply-To: andyqos@ukfsn.org
> 
> To possibly answer my own question for my case - AGP cards have accel 
> UTS/DFS disabled in UMS.

I use PCI-E card and I believe it has acceleration in both: UMS and KMS...
Comment 31 Pauli 2010-02-28 19:41:26 UTC
Why NOTOURBUG?

Because Linux flash player is buggy. It is trying to do something that is never going to have good performance. Uploading data to VRAM and reading it back late is always going to be slow.

Flash player knows that is is going to download back the video frames anyway soon so it shouldn't upload them. It is a lot faster to do everything in software than trying to do partial hardware acceleration.

I agree that r600 driver is not doing everything perfectly for what flash wants to do. But the way flash tries to play the video is always going to take at least 3 times more hw resources than standard alone players would take.
Comment 32 Anonymous Emailer 2010-02-28 19:50:01 UTC
Reply-To: andyqos@ukfsn.org

bugzilla-daemon@bugzilla.kernel.org wrote:

> --- Comment #30 from Rafał Miłecki<zajec5@gmail.com>   2010-02-28 19:01:21
> ---
>
> That's weird because we don't have problems with YouTube. It's about Vimeo.

Most youtube works for me, but some larger ones don't. I also see the 
effect of working/not working depending on uts with the kbluetooth vid 
above using seamonkey 2.02.

But I fear you may be correct and despite comment #24 I am seeing a 
separate issue and am derailing this bug, I will file a new one - I can 
now recreate my problem without flash.


> (In reply to comment #26)
>> Reply-To: andyqos@ukfsn.org
>>
>> To possibly answer my own question for my case - AGP cards have accel
>> UTS/DFS disabled in UMS.
>
> I use PCI-E card and I believe it has acceleration in both: UMS and KMS...

Yea, it would be interesting to know if mplayer -vo x11 Cpu usage is 
much different between the two for you.
Comment 33 Rafał Miłecki 2010-03-15 17:59:36 UTC
I've updated kernel and xf86-video-ati and Vimeo works great for me!

Can this be related to http://cgit.freedesktop.org/xorg/driver/xf86-video-ati/commit/?id=488c9fd8300505cc6c0c2f8f0f00849f27cc5d63 ?
Comment 34 Alex Deucher 2010-03-15 18:01:40 UTC
(In reply to comment #33)
> I've updated kernel and xf86-video-ati and Vimeo works great for me!
> 
> Can this be related to
>
> http://cgit.freedesktop.org/xorg/driver/xf86-video-ati/commit/?id=488c9fd8300505cc6c0c2f8f0f00849f27cc5d63
> ?

Yes, fixing the domains increased GetImage performance by a factor of ~10 for me.
Comment 35 Alex Deucher 2010-03-15 18:03:22 UTC
This bug can probably be closed.

Note You need to log in before you can comment on or make changes to this bug.