GDC 2014 一篇 Unreal講下一代手機繪圖的文章
新的硬體效能大概
300+ GFLOPS
26GB Bandwidth
http://blog.imgtec.com/powervr/graphics-cores-trying-compare-apples-apples
Mobile GPU:
Tile-Based: ImgTec, Qualcomm...
Direct: Nvidia, Intel, Qualcomm..
*Qualcomm Adreno 可切換tile-based or direct‧。
- Extension: GL_QCOM_binning_control
Tile-Based Mobile GPU
- 32x32 pixels (ImgTec) 300x300 (Qualcomm)
ImgTec G6430:
1 GPU core
4 USC (Unified Shading Cluster)
-16 pipeline each
FP16: 2 SOP (Sum of Product) per clock
- (a*b+c*d)
FP32: 2 MADD per clock
-(a*b +c)
154 GFLOPS @ 400 MHz
- 16 bit floating point
改善部分:
-Support OpenGL ES 3.0
-Scalar, not Vector
-No additional cost for dependent texture reads
- Pixelshader math on texture coordinates
- Texture coordinates can be in .zw (swizzle)
- 上述的意思是在讀貼圖前,對uv做運算不再有額外的cost,swizzle也不會有。
-Coherent dynamic flow control at full speed (not 1/4th speed)
-如果 branch是所有pixel 都做 eg.不是透過mask來branch,就不像之前會被降速
-FP16 is the minimun precision ( no more lowp)
-用lowp不會有額外效能,最低精度就是FP16
Scalar vs Vector
glsl : vec3 V = A*B + C*D;
Executed on SGX 543 GPU 75% speed , w component is wasted
vec4 V' = A*B;
vec4 V = C*D + V';
Executed on G6430 GPU full speed
half V.x = A.x*B.x + C.x*D.x;
half V.y = A.y*B.y + C.y*D.y;
half V.z = A.z*B.z + C.z*D.z;
ImgTec tip
-Hidden Surface Removal (HSR)
- For opaque only
- Don't keep alpha-test enabled all the time
- Don't keep "discard" keyword in shader source, even if it's not used (不要相信 compiler)
alpha-test 和 用了discard都會破壞 HSR
-把相同State都group在一起。
-Sort on State, not distant
-Qualcomm 的rendering process
會有2個vertex shader,其中一個在tile之前,會把face position save in RAM, 等全部triangle 都知道了再做tile。所有tile 結束再flush to RAM
-Qualcomm Snapdragon Rendering Tips
-Traditional handling of overdraw (via depth test)
–Cull as much as you can on CPU, to avoid both CPU and GPU cost
–Sort on distance (front to back) to maximize early z-rejection
The Adreno SIMD is wide
-效能問題主要是在temp register usage而不是instruction counts
-Hard optimize in glsl, check in profiler
-用ALU比Lookup Texture 快。
- 避免dependent texture fetches
Expensive to switch Frame Buffer Object on Tile-based GPUs
-避免太多RenderTarget 切換
-因為driver會把rendertarget save 到ram裡再copy出來。這樣很慢。
-要用就開新的。
Clear ALL FBO attachments after new frame/rendertarget
–Clear after eglSwapBuffers / glBindFramebuffer
–Avoids reloading FBO from RAM
–NOTE: Do NOT do unnecessary clears on non-tile-based GPUs (e.g. NVIDIA)
沒有直接的指令告訴api不要copy。只能用hint的,就是clear buffer。
Discard unused attachments before new frame/rendertarget
–Discard before eglSwapBuffers / glBindFramebuffer
–Avoids saving unused FBO attachments to RAM
–glDiscardFramebufferEXT / glInvalidateFramebuffer
Programmable Blending
GL_EXT_shader_framebuffer_fetch (gl_LastFragData)
•Reads current pixel background “for free”
•Potential uses:
–Custom color blending
–Blend by background depth value (depth in alpha)
•E.g. Soft intersection against world geometry for particles
–Deferred shading without resolving GBuffer
•Stay on GPU and avoid expensive round-trip to RAM
最基本的應用是 soft particle。讀background的depth是free的只是一個register
Tips
•Think scalar! Avoid using unnecessary components
–Avoid: (vec4*float)*float (8 MUL)
–Use: vec4*(float*float) (5 MUL)
•Prefer 16-bit floating point operations (mediump)
•Leverage ALU to hide memory fetches
–E.g. ALU can be faster than using lookup-textures
More Tips
•Don’t switch back and forth between mediump/highp
–ImgTec: Requires shader instructions to convert each time
–Qualcomm: Many conversions are free
精度轉換有額外cost。先做高精度,再做低精度的轉換。
•Branch spatially coherent for many pixels
–Uses predication to ignore invalid path
Core Optimization: Opaque Draw Ordering
•All platforms,
–1. Group draws by material (shader) to reduce state changes
•Then for all platforms except ImgTec,
–2. Skybox last: 5 ms/frame savings (vs drawing skybox first)
–3. Sort groups nearest first : extra 3 ms/frame savings
–4. Sort inside groups nearest first : extra 7 ms/frame savings
排序是必做的!! 對效能影響非常大。
參考:
GDC2014 Next-gen Mobile Rendering
沒有留言 :
張貼留言