Bug 91861

Summary: [Radeon RS780] Blank screen (no signal) on HDMI after boot in 3.15 & later
Product: Drivers Reporter: Mike S. (michael.selig)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: high CC: alexdeucher, deathsimple
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 3.15 + later Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg (kernel 3.18.3) - RS780 blank screen

Description Mike S. 2015-01-23 07:29:47 UTC
Created attachment 164451 [details]
dmesg (kernel 3.18.3) - RS780 blank screen

I use this machine as a HTPC connected to our TV via HDMI. It has been running Ubuntu of various versions using the RADEON driver with no problems until Ubuntu 4.10 (kernel 3.16), when this problem first showed itself. I can successfully boot 3.13 (from Ubuntu 4.04), but when it was announced that VDPAU was supported on this old Radeon hardware from 3.18, I was hoping that by upgrading to 3.18, it would not only solve the "blank screen" problem, but also give me better video performance on XBMC/KODI. Unfortunately 3.18 exhibits the same "blank screen" problem. I then tried installing other kernel versions (including 3.19rc5), but nothing from 3.15 onwards works.

The RS780 hardware is integrated on the motherboard, and is fast enough for videos & TV etc, so I don't really want to change it for a noisy and/or heat generating video card.

My symptoms are that, after seeing the Grub screen, I see a kernel message or 2 (nothing important) flashing up for a fraction of a second, but then it goes blank, saying "no signal". I do not see the normal purple Ubuntu splash screen, which appears before Xorg starts. Interestingly, on occasions it will boot 3.18 successfully, but that is usually after a reboot from 3.13 or 3.14. Once it boots successfully it seems fairly stable, and I have managed to watch videos and TV on 3.18 a couple of times though I haven't tested it for long. However if it is booted again, chances are that it will fail. When the screen is blank, the system is still up - I can SSH to the box fine. And it seems that Xorg is up and running fine - Xorg.0.log looks fine - it's just that I can't see anything!

I have tried radeon.runpm=0 radeon.dpm=0, but they make no difference.

I think this problem may be related to https://bugzilla.kernel.org/show_bug.cgi?id=83461.

Any ideas?

Thanks,
Mike
Comment 1 Mike S. 2015-01-23 07:35:42 UTC
Sorry, the Ubuntu versions should be 14.04 & 14.10 of course.

Mike
Comment 2 Alex Deucher 2015-01-23 19:15:17 UTC
(In reply to Mike S. from comment #0)
> I think this problem may be related to
> https://bugzilla.kernel.org/show_bug.cgi?id=83461.
>

Probably the same issue.  Do any of the patches on that bug help?
Comment 3 Mike S. 2015-01-24 11:05:01 UTC
I have never built Linux before, but I'll  give it a go. It might take me several days.

If I might make a suggestion, it would be nice you could set those divider values as boot params.
Comment 4 Mike S. 2015-01-27 08:08:59 UTC
I haven't yet tested the patch to cap ref_div to 7, but here are the values reported with drm.debug=0xE, in case they are useful:

1. Kernel 3.14 (working):
14851, fb: 145.2 ref: 2, post: 7

2. Kernel 3.18 (not working):
148500 - 148500, fb: 1016.3 ref: 14, post: 7

I am no expert, but looking at the code, a ref_div value of 14 seems wrong. In r600d.h the REF_DIV_MASK is only 3 bits wide, so the max value it should ever be is 7, right?
Comment 5 Christian König 2015-01-27 09:15:17 UTC
(In reply to Mike S. from comment #4)
> I am no expert, but looking at the code, a ref_div value of 14 seems wrong.
> In r600d.h the REF_DIV_MASK is only 3 bits wide, so the max value it should
> ever be is 7, right?

The REF_DIV_MASK in r600d.h is for the system clock setup, not for the display clock.
Comment 6 Mike S. 2015-01-28 23:46:39 UTC
I can confirm that the patch https://bugzilla.kernel.org/attachment.cgi?id=163891 to cap ref_div to 7 fixes the problem on my RS780. I did the test on kernel 3.18.3. Note that I also cast the number 7 in the min() to unsigned to avoid a compiler warning.

I'd like to run it for a few days to ensure that it is stable.

I'd also like to ask a question, please. With this patch, the GPU divider values are now running at:

148500 - 148490, pll dividers - fb: 580.7 ref: 7, post 8

while on 3.14 they were:

14851, pll dividers - fb: 145.2 ref: 2, post: 7

These new values produce a marginally different dot clock rate (the old debug msg displayed the value 10x less than the new one), but the difference is tiny, and probably doesn't make any practical difference.

Other than the dot clock, do the values of the 3 dividers affect anything else in the GPU? Are there any advantages or disadvantages running the FB and REF substantially higher (as they are now)?

Thank you!
Comment 7 Christian König 2015-01-29 13:00:14 UTC
(In reply to Mike S. from comment #6)
> I can confirm that the patch
> https://bugzilla.kernel.org/attachment.cgi?id=163891 to cap ref_div to 7
> fixes the problem on my RS780. I did the test on kernel 3.18.3. Note that I
> also cast the number 7 in the min() to unsigned to avoid a compiler warning.

Thanks for testing.

> I'd also like to ask a question, please. With this patch, the GPU divider
> values are now running at:
> 
> 148500 - 148490, pll dividers - fb: 580.7 ref: 7, post 8
> 
> while on 3.14 they were:
> 
> 14851, pll dividers - fb: 145.2 ref: 2, post: 7
> 
> These new values produce a marginally different dot clock rate (the old
> debug msg displayed the value 10x less than the new one), but the difference
> is tiny, and probably doesn't make any practical difference.

It does make quite a difference, that's why I've added the new code in the first place. Some people had problems with the old one resulting in unstable signals and audio/video desync after a while.

> Other than the dot clock, do the values of the 3 dividers affect anything
> else in the GPU? Are there any advantages or disadvantages running the FB
> and REF substantially higher (as they are now)?
When you use higher dividers in a PLL you can get closer to the desired frequency, but at the cost of higher jitter. The trick is to stay within the limits of the PLL. For example the VCO shouldn't get to fast otherwise the input voltage gets to high, the feedback divider shouldn't get to high cause otherwise the jitter is to high for the PLL to follow etc...

We most likely have an undocumented limit on the input frequency here, otherwise limiting the reference divider won't have such an effect.

Christian.
Comment 8 Mike S. 2015-01-29 22:20:15 UTC
Thanks for the explanation.

Can I suggest that you delay committing this patch for a week or so - I want to check that it is really stable. Yesterday I had one incident where the screen froze with diagonal coloured stripes for a couple of seconds, which I have never seen before. It recovered itself OK without having to re-boot. Could this have been caused by the higher divider values?
Comment 9 Alex Deucher 2015-01-30 01:21:08 UTC
(In reply to Mike S. from comment #8)
> Thanks for the explanation.
> 
> Can I suggest that you delay committing this patch for a week or so - I want
> to check that it is really stable. Yesterday I had one incident where the
> screen froze with diagonal coloured stripes for a couple of seconds, which I
> have never seen before. It recovered itself OK without having to re-boot.
> Could this have been caused by the higher divider values?

Probably not.  Usually PLL problems manifest themselves as no signal or periodic signal drop outs.  Keep in mind that prior to kernel 3.15, the driver used a different algorithm that had its own limitations.
Comment 10 Christian König 2015-01-30 15:43:55 UTC
(In reply to Alex Deucher from comment #9)
> (In reply to Mike S. from comment #8)
> > Thanks for the explanation.
> > 
> > Can I suggest that you delay committing this patch for a week or so - I
> want
> > to check that it is really stable. Yesterday I had one incident where the
> > screen froze with diagonal coloured stripes for a couple of seconds, which
> I
> > have never seen before. It recovered itself OK without having to re-boot.
> > Could this have been caused by the higher divider values?
> 
> Probably not.  Usually PLL problems manifest themselves as no signal or
> periodic signal drop outs.  Keep in mind that prior to kernel 3.15, the
> driver used a different algorithm that had its own limitations.

Yeah, agree. For PLLs you usually know in a few minutes if your setup works or not.

The only thing that could affect it a bit is the temperature, but then you are driving the PLL so close to the edge that you will notice problems relatively soon as well.
Comment 11 Mike S. 2015-01-30 22:44:49 UTC
Thanks for the info. It hasn't happened again so far.

> ... you are driving the PLL so close to the edge ...
Can you please explain what you mean? Are you saying that the higher fb_div & ref_div are pushing the PLL harder or faster?

Please bear in mind that I am a programmer, not an electrical engineer, and though I do understand the basic principles of video displays (dot clocks, refresh rates, blanking etc), to me these dividers are just numbers used to set the dot clock.
Comment 12 Christian König 2015-01-31 09:55:56 UTC
(In reply to Mike S. from comment #11)
> Thanks for the info. It hasn't happened again so far.
> 
> > ... you are driving the PLL so close to the edge ...
> Can you please explain what you mean? Are you saying that the higher fb_div
> & ref_div are pushing the PLL harder or faster?

If you want details you should probably read them up on Wikipedia, but I will try to explain it in a few sentences.

The basic problem is how to generate a stable but still programmable frequency in electronics.

On the one hand you have crystals which are usually very stable over a long period of time, but to change the frequency of a crystal you would need to change it's size and/or the material it is made of. Clearly not something you can do with software.

On the other hand you have voltage controllable oscillators (VCO), which have the nice feature that you can push a certain voltage into them and get a certain frequency in return. Problem with those is that they are not temperature stable, e.g. you set up a certain voltage to get 100Mhz and after a while the electronics has warmed up and you suddenly get 101Mhz or 99Mhz or something like this.

The solution is a PLL, it compares a very stable input frequency to a frequency generated with a VCO and based on the difference adjusts the input voltage of the VCO. This way the VCO frequency is slowly adjusted to the stable input frequency and also stable over time when the electronics warms up.

Imagine that you now put a counter between the VCO output and the comparator input that counts to 3 before forwarding the VCO frequency to the comparator. This way when you input a frequency of 100Mhz you get 300Mhz as output frequency. This counter is easily adjustable with software and called the feedback divider.

So in the end you've got 300Mhz. But what do we do when for example we want 150Mhz? For this the electronics got a so called post divider it's also software configurable and when for example when you have 300Mhz VCO frequency and set the divider to 2 you get 150Mhz at the PLL output.

The voltage given as input to the VCO needs to be in certain limits. The upper limit is easily understandable imagine that you push 20V into a 5V circuit all you usually get is blue smoke and useless electronics. The lower limit usually isn't so easily explainable, just keep in mind that the electronics need a certain voltage to start up. More rarely you also got voltage gaps where the VCO won't produce a stable output etc...

So to stay within those limits you add a third divider which is put between the input frequency and the comparator. This divider is called the reference divider.

Driving the PLL at the edge now means that at some point the VCO is to close to one of it's limits. E.g. when it startup the VCO usually doesn't match the input frequency at all and because of this the voltage input jumps around quite a bit. So it sometimes works and sometimes doesn't. Or it initially works but after the electronics has warmed up the VCO is suddenly out of range etc etc....

Maybe that were a few more sentences than I expected, hopefully it was still understandable.
Comment 13 Mike S. 2015-02-01 22:19:02 UTC
Wow, thank you very much for the detailed explanation :-)

Though all I was really expecting was a "yes/no or not sure" answer to  whether the new higher values of fb_dev & ref_div on my RS780 could be responsible for pushing the PLL "closer to the edge" than with the driver on kernel 3.14, and therefore maybe less stable and/or produce more heat.

So, from your explanation, using a ref_div of 7 compared to 2, means that the VCO is running at a higher frequency, and hence, higher voltage. So I assume that would mean the GPU would be drawing more power & therefore run hotter, though I am not sure how significant this is to its overall power use.

Is that right?

The reason I would like to know this is to determine whether the new divider numbers would cause the GPU to be less stable & hotter, which is fairly important in a HTPC where I am trying to keep fan noise to a minimum. If so, I might then try to force the ref_div below 7 in the driver (as I know it has been stable for several years at 2).
Comment 14 Christian König 2015-02-06 09:44:11 UTC
(In reply to Mike S. from comment #13)
> So, from your explanation, using a ref_div of 7 compared to 2, means that
> the VCO is running at a higher frequency, and hence, higher voltage. So I
> assume that would mean the GPU would be drawing more power & therefore run
> hotter, though I am not sure how significant this is to its overall power
> use.
> 
> Is that right?

No, just the other way around. A higher reference divider results in a lower post divider and a lower feedback divider and that in turn results in a lower voltage and lower power consumption.

But the over all power consumption of a PLL compared to the whole system or even the GFX block is completely neglitable. I would guess something between a 1/1000000 and 1/100000 of a Watt compared to a couple of Watt for the whole system.
Comment 15 Mike S. 2015-02-06 09:52:39 UTC
Thanks again for the info.

I have been running the Radeon driver with the patch to limit the reference divider to 7 for a week now, and I can confirm that apart from that original incident, it has been very stable.

So thank you for the patch, and I hope it can get into 3.19.
Comment 16 Christian König 2015-02-06 10:08:18 UTC
(In reply to Mike S. from comment #15)
> So thank you for the patch, and I hope it can get into 3.19.

Alex has added it to his -fixes branch, so it either ends up in 3.19 or in 3.20.

Can you close the bug report then?