Why does those Google image processing sample Renderscript runs slower on GPU in Nexus 5

2k Views Asked by At

I'd like to thank Stephen for the very quick reply in a previous post. This is a follow up question for this post Why very simple Renderscript runs 3 times slower in GPU than in CPU

My dev platform is as follows

Development OS: Windows 7 32-bit
Phone: Nexus 5
Phone OS version: Android 4.4
SDK bundle: adt-bundle-windows-x86-20131030
Build-tool version: 19
SDK tool version: 22.3
Platform tool version: 19

In order to evaluate the performance of Renderscript GPU compute and to grasp the general trick of making code faster by Renderscript, I did the following test.

I checked out the code from Google's android open source project, using tag android-4.2.2_r1.2 . I used this tag simply because the ImageProcessing test sample is not available in the newer version.

Then I used the project under "base\tests\RenderScriptTests\ImageProcessing" in the test. I recorded the performance of running code on GPU as well CPU and the performance is listed below.

                         GPU    CPU
Levels Vec3 Relaxed     7.45ms  14.89ms
Levels Vec4 Relaxed     6.04ms  12.85ms
Levels Vec3 Full        N/A     28.97ms
Levels Vec4 Full        N/A     35.65ml
Blur radius 25          203.2ms 245.60ms
Greyscale               7.16ms  11.54ms
Grain                   33.33ms 21.73ms
Fisheye Full            N/A     51.55ms
Fisheye Relaxed         92.90ms 45.34ms
Fisheye Approx Full     N/A     51.65ms
Fisheye Approx Relaxed  93.09ms 39.11ms
Vignette Full           N/A     44.17ms
Vignette Relaxed        8.02ms  46.68ms
Vignette Approx Full    N/A     45.04ms
Vignette Approx Relaxed 8.20ms  43.69ms
Convolve 3x3            37.66ms 16.81ms
Convolve 3x3 Intrinsics N/A     4.57ms
ColorMatrix             5.87ms  8.26ms
ColorMatrix Intrinsics  N/A     2.70ms
ColorMatrix Intinsics Grey  N/A 2.52ms
Copy                    5.59ms  2.40ms
CrossProcess(using LUT) N/A     5.74ms
Convolve 5x5            84.25ms 46.59ms
Convolve 5x5 Intrinsics N/A     9.69ms
Mandelbrot              N/A     50.2ms
Blend Intrinsics        N/A     21.80ms

The N/A in the table is caused by either full precision or rs intrinsics doesn't running on GPU. We can see that among 13 algorithms running on GPU, 6 of them runs slower on GPU. Since such code was written by Google, I'd consider this phenomenon is somewhat worth investigating. At least, "I assume the code will run faster on the GPU" I saw from Renderscript and the GPU doesn't hold here.

I investigated some of the algorithms in the list, I'd like to mention two.

In Vignette, the performance on GPU is much better, I found this was used by invoking several functions within rs_cl.rsh. If I comment out those functions, CPU will run faster (see my previous question on the top for an extreme case). So the question is why this happens. In rs_cl.rsh, most of the functions are math related, e.g. exp, log, cos, etc. Why such function runs a lot faster on GPU, is this because the implementation of those functions are actually high paralleled or just because the implementation of the version runs on GPU is better than the version runs on CPU?

Another example is conv3x3 and conv5x5. Though there're other more clever implementation than Google's version in this test app, I think this implementation by Google is certainly not bad. It tries to minimize the addition operation and uses some facilitation function from rs_cl.rsh such as convert_float4(). So at a glance, I assume it will run faster on GPU. However, it runs a lot slower (on Nexus 4 and 5 both using Qualcomm's GPU). I think this example is very representative since in the implementation, the algorithm needs to access the pixels near the current pixel. Such operation is quite common in many image processing algorithms. If the implementation like 2D convolution can't be made faster in GPU, I suspect there're many other algorithms would suffer the same. It would be highly appreciated if you can identify where the problem is and suggest some ways to make such algorithms faster.

The more general question is, given the test result I showed, I'd like to ask what kind of criterions people should follow to get the higher performance and avoid the performance degradation as much as possible. After all, the goal of performance is the second most important goal of Renderscript and I think the portability of RS is quite good.

Thank you!

1

There are 1 best solutions below

8
On

There are really two answers to this question.

1: Don't believe the hype regarding GPUs. For some workloads they are faster. However, for many workloads, the difference is small or negative. You have at least 2 different processor types, don't worry about which one get used, only worry if the performance is what you want.

2: For performance tuning I would really focus on the algorithm and avoiding slow operations. Examples:

  • Prefer float to double when float provides adequate precision.

  • Use RS_FP_RELAXED when you don't need IEEE-754 compliance

  • Prefer multiplication to division

  • use native_* (ex: native_powr) in place of the full precision routines where the precision is adequate

  • Use rsGetElementAt_* over rsSample or rsGetElementAt. The typed version of get are faster that the general get and much faster than rsSample in many cases.

  • loads from script globals are typically faster than loads from an rs_allocation. Prefer global for kernel constants.

3: There are some performance issues with global loads today on the Nexus (4,5,7v2) GPU path. These will be improved with updates.