2014年6月27日 星期五

2014 mobile 應注意

 GDC 2014 一篇 Unreal講下一代手機繪圖的文章

新的硬體效能大概
300+ GFLOPS
26GB Bandwidth

http://blog.imgtec.com/powervr/graphics-cores-trying-compare-apples-apples

Mobile GPU:
  Tile-Based: ImgTec, Qualcomm...
  Direct: Nvidia, Intel, Qualcomm..

*Qualcomm Adreno 可切換tile-based or direct‧。
    - Extension: GL_QCOM_binning_control

Tile-Based Mobile GPU
- 32x32 pixels (ImgTec) 300x300 (Qualcomm)

ImgTec G6430:
 1 GPU core
 4 USC (Unified Shading Cluster)
   -16 pipeline each
FP16: 2 SOP (Sum of Product) per clock
  - (a*b+c*d)
FP32: 2 MADD per clock
  -(a*b +c)
154 GFLOPS @ 400 MHz
  - 16 bit floating point

改善部分:
 -Support OpenGL ES 3.0
 -Scalar, not Vector 
 -No additional cost for dependent texture reads
   - Pixelshader math on texture coordinates
   - Texture coordinates can be in .zw (swizzle)
   - 上述的意思是在讀貼圖前,對uv做運算不再有額外的cost,swizzle也不會有。
 -Coherent dynamic flow control at full speed (not 1/4th speed)
   -如果 branch是所有pixel 都做  eg.不是透過mask來branch,就不像之前會被降速
-FP16 is the minimun precision ( no more lowp)
   -用lowp不會有額外效能,最低精度就是FP16


Scalar vs  Vector
glsl : vec3 V = A*B + C*D;
Executed on SGX 543 GPU   75% speed , w component is wasted
  vec4 V' = A*B;
  vec4 V = C*D + V';

Executed on G6430 GPU    full speed
 half V.x = A.x*B.x + C.x*D.x;
 half V.y = A.y*B.y + C.y*D.y;
 half V.z = A.z*B.z + C.z*D.z;

ImgTec tip
-Hidden Surface Removal (HSR)
  - For opaque only
  - Don't keep alpha-test enabled all the time
  - Don't keep "discard" keyword in shader source, even if it's not used (不要相信 compiler)
    alpha-test 和 用了discard都會破壞 HSR

-把相同State都group在一起。
-Sort on State, not distant

-Qualcomm 的rendering process
  會有2個vertex shader,其中一個在tile之前,會把face position save in RAM, 等全部triangle 都知道了再做tile。所有tile 結束再flush to RAM

-Qualcomm Snapdragon Rendering Tips
  -Traditional handling of overdraw (via depth test)
    –Cull as much as you can on CPU, to avoid both CPU and GPU cost
    –Sort on distance (front to back) to maximize early z-rejection
The Adreno SIMD is wide
   -效能問題主要是在temp register usage而不是instruction counts
   -Hard optimize in glsl, check in profiler
   -用ALU比Lookup Texture 快。
  - 避免dependent texture fetches

Expensive to switch Frame Buffer Object on Tile-based GPUs
  -避免太多RenderTarget 切換
  -因為driver會把rendertarget save 到ram裡再copy出來。這樣很慢。
  -要用就開新的。

Clear ALL FBO attachments after new frame/rendertarget
–Clear after eglSwapBuffers / glBindFramebuffer
–Avoids reloading FBO from RAM
–NOTE: Do NOT do unnecessary clears on non-tile-based GPUs (e.g. NVIDIA)
  沒有直接的指令告訴api不要copy。只能用hint的,就是clear buffer。

Discard unused attachments before new frame/rendertarget
–Discard before eglSwapBuffers / glBindFramebuffer
–Avoids saving unused FBO attachments to RAM
–glDiscardFramebufferEXT / glInvalidateFramebuffer

Programmable Blending
GL_EXT_shader_framebuffer_fetch (gl_LastFragData)
•Reads current pixel background “for free”
•Potential uses:
–Custom color blending
–Blend by background depth value (depth in alpha)
•E.g. Soft intersection against world geometry for particles
–Deferred shading without resolving GBuffer
•Stay on GPU and avoid expensive round-trip to RAM
最基本的應用是 soft particle。讀background的depth是free的只是一個register

Tips
•Think scalar! Avoid using unnecessary components
–Avoid: (vec4*float)*float (8 MUL)
–Use: vec4*(float*float) (5 MUL)
•Prefer 16-bit floating point operations (mediump)
•Leverage ALU to hide memory fetches
–E.g. ALU can be faster than using lookup-textures

More Tips
•Don’t switch back and forth between mediump/highp
–ImgTec: Requires shader instructions to convert each time
–Qualcomm: Many conversions are free
   精度轉換有額外cost。先做高精度,再做低精度的轉換。
•Branch spatially coherent for many pixels
–Uses predication to ignore invalid path

Core Optimization: Opaque Draw Ordering
•All platforms,
–1. Group draws by material (shader) to reduce state changes
•Then for all platforms except ImgTec,
–2. Skybox last: 5 ms/frame savings (vs drawing skybox first)
–3. Sort groups nearest first : extra 3 ms/frame savings
–4. Sort inside groups nearest first : extra 7 ms/frame savings
排序是必做的!! 對效能影響非常大。

參考:
GDC2014 Next-gen Mobile Rendering

2014年6月19日 星期四

Deferred Rendering 下的特殊需求

這是針對Unity4的Deferred Rendering,因為5的做法會大改,而且也還沒發佈所以不確定。

最近有個特殊需求是,在做PostEffect時,希望能把一些東西排除掉,列如背景去飽和了,但人物想不被受影響。

但這個在Deferred rendering下會有問題,因為Unity的Deferred Rendering的PostEffect只能作用在最後一隻畫的Camera上,因此想了幾個方法。

1.畫人物時同時畫Stencil,PostEffect判斷Stencil,把人物排除。這個缺點是邊緣會有些鋸齒,Stencil應該沒AA。而且半透明物件就沒辦法。

2.人物在PostEffect之後用Forward畫,。缺點是,Camera深度沒共用,人不會被場景遮蔽。

3.人物用Shader tag,用一隻Camera RenderwithShader,輸出mask。但RenderWithShader同上,只能在Forward下使用,Mask也沒深度。而且另外做mask,drawcall也會增加。

4.把場景、特效、人物,用不用的RenderTexture輸出,最後在另一個PostEffect合併。這是個可行的方法,會增加一些Blit的drawcall,但是固定的。只是會用掉很多Vram。一張Full HD的rendertarget還滿大的。

上述方法好像都有缺點。

如果不同render path的Camera能Share Depth buffer就好了。

所以,如果拿Deferred Rendering的Camera深度,寫到Forward那隻Camera不就解決了?
於是2的Modify選項就出現了。

Camera結構如下:
MainCamera <-- Deferred Rendering
   -PlayerCamera  <-  Forward Rendering (clear flag -> don't clear)

在PlayerCamera掛上一個DepthWrite的Script
Cull Mask設定一下

using UnityEngine;
using System.Collections;

[ExecuteInEditMode]
[RequireComponent(typeof(Camera))]
public class DetphWrite : MonoBehaviour {

 private Material DepthWriteMaterial = null;
 // Use this for initialization
 void Start () {
  DepthWriteMaterial = new Material(Shader.Find("Hidden/DepthWrite"));
 }
 

 void OnPreRender()
 {

  DrawQuad();
 }

 void DrawQuad()
 {
  GL.PushMatrix(); 
  GL.LoadOrtho();
  
  DepthWriteMaterial.SetPass(0);     
  
  //Render the full screen quad manually.  
  GL.Begin(GL.QUADS); 
  GL.TexCoord2(0.0f, 0.0f); GL.Vertex3(0.0f, 0.0f, 0.1f);  
  GL.TexCoord2(1.0f, 0.0f); GL.Vertex3(1.0f, 0.0f, 0.1f);  
  GL.TexCoord2(1.0f, 1.0f); GL.Vertex3(1.0f, 1.0f, 0.1f);  
  GL.TexCoord2(0.0f, 1.0f); GL.Vertex3(0.0f, 1.0f, 0.1f);  
  GL.End();
  
  GL.PopMatrix();
 }
}

Shader如下:
Shader "Hidden/DepthWrite" {
    Properties {
        _MainTex ("Base (RGB)", 2D) = "black" {}
    }
SubShader {
    Pass {
        ZTest Always Cull Off ZWrite On
      Fog { Mode off }
        
        CGPROGRAM        
            #pragma exclude_renderers gles flash
            #pragma vertex vert
            #pragma fragment frag
            #pragma target 4.0
             #include "UnityCG.cginc" 
            // vertex input: position, UV
            struct appdata {
                float4 vertex : POSITION;
                float2 texcoord : TEXCOORD0;
            };
           
            struct v2f {
                float4 pos : SV_POSITION;
                float2 uv : TEXCOORD0;
            };
           
            v2f vert (appdata v) {
                v2f o;
                o.pos = mul( UNITY_MATRIX_MVP, v.vertex );
                o.uv = v.texcoord.xy;
                return o;
            }
           

            
            sampler2D _MainTex;     
            sampler2D _CameraDepthTexture;
            
            
            struct fragOut
           {
               // half4 color : COLOR;   don't need
               float depth : DEPTH;
            };
          
            fragOut frag( v2f i ) {
           
                fragOut o;
               
                float depth =UNITY_SAMPLE_DEPTH(tex2D(_CameraDepthTexture,i.uv));
              
              // o.color= 0; // 
                o.depth= depth;
                return o;
           }
            
            

        ENDCG
        }
    }
}
本來以為_CameraDepthTexture要自己copy,但發現竟然Get到的是對的,值沒被Clear掉。如果Forward那支Camera,有下Clear flag Depth,就會被清掉。所以如果要自己畫Mask,前面3的方式,應該也可以傳進來自己判斷。

這個方法的缺點是用到Shader Model 4.0,要把Dx11的flag打開。
另外,在PlayerCamera畫的Soft Particle會失效(明明 _CameraDepthTexture抓得到),我把Particle Shader中#ifdef SOFTPARTICLES_ON 註解掉就可以。應該是Unity判斷Camera不同Render Path時,就 define off了。

如果要在d3d9下可以用,另一個想到的方法是在vertex shader讀入depth map,然後事前產生一個plane,它的點數和螢幕pixel一樣多,再寫入depth。可能對點的位置會有點麻煩,但理論上可行。

2014年6月18日 星期三

一堆Forward限定

Unity文件爛到有剩,一堆東西Forward限定的也不清楚

整理一下,日後再補充

1. Shader tag,所以Camera.RenderWithShader也沒用。
2. Camera.targetTexture:  想抓rendertarget 就用blit吧
3. Stencil Buffer