Constexpr time compute graph generation in cpp

60 Views Asked by At

I have recently started prototyping my HPC/ML library, its mostly for research and learning. I was hoping to get some advice on design problems, I am only aiming to support latest releases of GCC, Clang, Intel compilers so any available c++ features recommended would be appreciated.

The library works in such steps here is the current prototype implementation api.

#include "manifold/array.hpp"
#include "manifold/constants.hpp"
#include "manifold/ops/element_wise_ops.hpp"
#include "manifold/static_graph.hpp"
#include "manifold/tensor.hpp"
#include "manifold/utility.hpp"
#include "scions/ep/cpu/cpu_mem_store.hpp"
#include "scions/ep/cpu/exec_graph_gen.hpp"

consteval auto buildGraph() noexcept {
  using namespace manifold;
  // Example usage 1, 2, 3 etc are id
  // Also Every Tensor is an Expression
  constexpr auto arr1 = Array<DType::F32, 200>(1);
  constexpr auto arr2 = Array<DType::F32, 200>(2);
  constexpr auto arr3 = Array<DType::F32, 200>(3);

  constexpr auto ten     = Tensor<TBase<DType::F32, 10, 20>>(4);
  constexpr auto ten_arr = ten.toArray();

  // ArrayAdd just returns a Runtime Expression class which includes input and output info using pointers.
  // and other relevent data
  auto add_1_2         = op::array_add(ten_arr, std::array{ arr1, arr2 });
  auto mul_1_2         = op::array_mul(arr3, std::array{ arr1, arr2, ten_arr });
  const std::array exp = { &add_1_2, &mul_1_2 };
  // Static Graph is a flattened graph into array of Expressions
  return StaticGraph(exp);
}


int main() {

  static constexpr auto fat_graph     = buildGraph();
  static constexpr auto meta    = getMetadata(fat_graph);
  // Certain performance optimisations at compile time to the graph
  static constexpr auto optimised_graph = manifold::optimise(fat_graph);
  static constexpr auto compact = manifold::compact<meta>(optimised_graph);

  scions::cpu::CpuMemStore<meta> mem_store(compact);
  mem_store.initializeMemory();

  scions::cpu::exec_cpu_graph<compact>(mem_store);

  // ...
}
  • Build a computational graph/tree using tensors and ops (Should be fully constexpr compatible or rather main focus).
  • Allocate Execution Provider memory
  • Generate code to be executed using templates ie cpu::exec_cpu_graph. Here the function recursively templates over the graph to generate all the calls the ops.

There are quite a bit of problem because of this approach which I am hoping to seek advice on. I will list out 2 main problems

  • Array everything problem : Any compile time computed value like the StaticGraph must be of predetermined size. This is causing a lot of problems with my approach. Currently I am creating a pointer based tree for each Expression with each Expression having pointers to its inputs and outputs. Then it is flattened in StaticGraph just like it would be during runtime but in consteval. So, I don't know any size info. Which leads to me having predetermined macros for setting the sizes of input/output array of each Expression. I can reduce the size somewhat using metadata (clipping the max actual len of data) after the generation of compile time fat graph . But this seems far too hacky.

For context this FlatExpression which is stored in array in StaticGraph (which also has a hard limit for fat StaticGraph)

struct FlatExpression {
  OpType type;
  uint64_t id;
  uint32_t num_inputs;
  uint32_t num_outputs;
  std::array<std::uint32_t, MANIFOLD_MAX_EXP_INPUT> input_indices;
  std::array<std::uint32_t, MANIFOLD_MAX_EXP_OUTPUT> output_indices;
  std::size_t hash;
}
  • Op parameter passing : This is quite a challenge as although I know parameters at compile time, every op has different parameter types. One method would be to store parameter structs as bytes in an array (as described in above problem which would need to have a predetermined max size) and pass it to different EP OPs after casting to appropriate parameter struct type using opType?

This screams design problem to me, I feel like there must be a better way to do this array Max-Clip-Cast procedure. Is there a way to implement similar api while keeping intact graph construction at constexpr. I don't know if this is enough information but either way thanks for the help.

PS: In the code above arr1, arr2, arr3 etc. have the information but I can't really propogate them becuase ExpressionReflection object returned by every op should be able to accept other ExpressionReflection objects which has lead the struct

struct ExpressionReflection {
  // Expression part

  // OpType of the expression
  OpType type;
  uint32_t num_inputs;
  uint32_t num_outputs;
  std::unique_ptr<std::array<ExpressionReflection, MANIFOLD_MAX_EXP_REF_INPUT>> inputs;
  std::unique_ptr<std::array<ExpressionReflection, MANIFOLD_MAX_EXP_REF_INPUT>> outputs;

  // TensorReflection part
  std::optional<TensorReflection> tensor;
// ...

to be self recursive hence ill defined. I feel like if there was an alternative way to have this recursive structure in constexpr compatible manner, it would solve most of my problems. Thanks for taking the time to reply and read.

0

There are 0 best solutions below