Limits of workload that can be put into hardware accelerators

141 Views Asked by At

I am interested in understanding what's the percentage of workload that can almost never be put into a hardware accelerators. While more and more tasks are being amenable to domain specific accelerators, I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?

Would love to have a pointers to resources that speaks to this question.

1

There are 1 best solutions below

2
On BEST ANSWER

So you have the following question(s) in your original post:


Question:

  • I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?

Answer:

Of course it's possible. First and foremost, workload that needs to be accelerated on hardware accelerators should not involve following:

  • dynamic polymorphism and dynamic memory allocation
  • runtime type information (RTTI)
  • system calls
  • ........... (some more depending on the hardware accelerator)

Although explaining each above-mentioned point will make the post too lengthy, I can explain few. There is no support of dynamic memory allocation because hardware accelerators have fixed set of resources on silicon, and the dynamic creation and freeing of memory resources is not supported. Similarly dynamic polymorphism is only supported if the pointer object can be determined at compile time. And there should be no System calls because these are actions that relate to performing some task upon the operating system. Therefore OS operations, such as file read/write or OS queries like time and date, are not supported.

Having said that, the workload that are less likely to be accelerator-compatible are mostly communication intensive kernels. Such communication intensive kernels often lead to a serious data transfer overhead compared to the CPU execution, which can probably be detected by the CPU-FPGA or CPU-GPU communication time measurement.

For better understanding, let's take the following example:

Communication Intensive Breadth-First Search (BFS):

 1  procedure BFS(G, root) is
 2      let Q be a queue
 3      label root as explored
 4      Q.enqueue(root)
 5      while Q is not empty do
 6          v := Q.dequeue()
 7          if v is the goal then
 8              return v
 9          for all edges from v to w in G.adjacentEdges(v) do
10              if w is not labeled as explored then
11                  label w as explored
12                  Q.enqueue(w)

The above pseudo code is of famous bread-first search (BFS). Why it's not a good candidate for acceleration? Because it traverses all the nodes in a graph without doing any significant computation. Hence it's immensely communication intensive as compared to compute intensive. Furthermore, for a data-driven algorithm like BFS, the shape and structure of the input can actually dictate runtime characteristics like locality and branch behaviour , making it not so good candidate for hardware acceleration.

  • Now the question arises why have I focused on compute intensive vs communication intensive?

As you have tagged FPGA in your post, I can explain you this concept with respect to FPGA. For instance in a given system that uses the PCIe connection between the CPU and FPGA, we calculate the PCIe transfer time as the elapsed time of data movement from the host memory to the device memory through PCIe-based direct memory access (DMA).

The PCIe transfer time is a significant factor to filter out the FPGA acceleration for communication bounded workload. Therefore, the above mentioned BFS can show severe PCIe transfer overheads and hence, not acceleration compatible.

On the other hand, consider a the family of object recognition algorithms implemented as a deep neural network. If you go through these algorithms you will find that a significant amount of time (more than 90% may be) is spent in the convolution function. The input data is relatively small. The convolutions are embarrassingly parallel. And this makes it them ideal workload for moving to hardware accelerator.

Let's take another example showing a perfect workload for hardware acceleration:

Compute Intensive General Matrix Multiply (GEMM):

void gemm(TYPE m1[N], TYPE m2[N], TYPE prod[N]){
    int i, k, j, jj, kk;
    int i_row, k_row;
    TYPE temp_x, mul;

    loopjj:for (jj = 0; jj < row_size; jj += block_size){
        loopkk:for (kk = 0; kk < row_size; kk += block_size){
            loopi:for ( i = 0; i < row_size; ++i){
                loopk:for (k = 0; k < block_size; ++k){
                    i_row = i * row_size;
                    k_row = (k  + kk) * row_size;
                    temp_x = m1[i_row + k + kk];
                    loopj:for (j = 0; j < block_size; ++j){
                        mul = temp_x * m2[k_row + j + jj];
                        prod[i_row + j + jj] += mul;
                    }
                }
            }
        }
    }
}

The above code example is General Matrix Multiply (GEMM). It is a common algorithm in linear algebra, machine learning, statistics, and many other domains. The matrix multiplication in this code is more commonly computed using a blocked loop structure. Commuting the arithmetic to reuse all of the elements in one block before moving onto the next dramatically improves memory locality. Hence it is an extremely compute intensive and perfect candidate for acceleration.

Hence, to name only few, we can conclude following are the deciding factors for hardware acceleration:

  • the load of your workload
  • the data your workload accesses,
  • how parallel is your workload
  • the underlying silicon available for acceleration
  • the bandwidth and latency of communication channels.

Do not forget Amdahl's Law:

Even if you have found out the right workload that is an ideal candidate for hardware acceleration, the struggle does not end here. Why? Because the famous Amdahl's law comes into play. Meaning, you might be able to significantly speed up a workload, but if it is only 2% of the runtime of the application, then even if you speed it up infinitely (take the run time to 0) you will only speed the overall application by 2% at the system level. Hence, your ideal workload should not only be an ideal workload algorithmically, in fact, it should also be contributing significantly to the overall runtime of your system.