I created a bounding volume hierarchy that is generated every frame. Due to it's use, each node must have two children, no more, no less.
Traversal is the single most expensive computation for my program right now and it prevents large scenes (>2k triangles) from running at acceptable frame rates. I am at a loss as to how it can be performed faster. A simple square with 16 triangles introduces a noticeable frame drop when many rays pass through it at once.
To traverse it I used the concepts provided by the paper at this link, where each thread traverses the following code.
What can be done to make it increase it's speed?
At the moment, this function is used in a much larger kernal. Will isolating it within its own dispatch call help at the cost of using more memory to store its output?
//o= origin , d =direction, bestT = the intersection distance
bool FindBVHIntersect(vec3 o, vec3 d,inout float bestT){
//starting at the root of the bvh(0) the locations of the two children are found (near , far)
uint current=0;
uint last=0;
uint near=uint(bvh[current].w);
uint far=uint(bvh[current+1].w);
//max distance to test intersection
float distance=4294967295.0f;
bool success=false;
bool run= true;
while(run){
//update the children's location
near=uint(bvh[current].w);
far=uint(bvh[current+1].w);
//traversing up from the last child
//ending
if(last==far&¤t==0){
run=false;
continue;
}
//go to the parent of the node
else if(last==far){
last=current;
current=bvhAtomics[current/2].y;
continue;
}
//depending on the last position, pick a child
uint tryChild= (last==bvhAtomics[current/2].y)?near:far;
bool delve;
//test the AABB
if(tryChild==near){
delve=FindAABBIntersect(bvh[current].xyz,bvh[current+1].xyz,o,d,0.0f,distance);
}
else{
delve=true;
}
//if the child is a node and needs to be delved. A triangle intersection test.
if(delve&&tryChild>=(nodeSize-1)*2){
float pt;
float ob1;
float ob2;
uint bvhPos=uint(bvh[tryChild+2].x);
uvec3 indPos=indices[bvhPos].xyz;
bool tr = FindTriangleIntersect(vertices[indPos.x].xyz, vertices[indPos.y].xyz, vertices[indPos.z].xyz, o, d, pt, ob1, ob2 );
if(tr){
distance=pt;
float t = pt;
if (t > 0 && t < bestT) {
bestT = t;
success=true;
}
}
last=tryChild;
}
//switching children or setting up for climbing up the tree
else if(delve){
last =current;
current=tryChild;
}
else{
last=far;
}
}
return success;
}
I tried to minimize the memory accesses within the context of the algorithm, however the divergence, which I assume to be the problem, seems an impossible hurdle to overcome.