I’m working on some task related to graph traversal (Viterbi algorithm).
Each time step I have a compacted set of active states, some job is done in each state, and than results are propagated through outgoing arcs to each arc’s destination state and so new active set of states is built.
The problem is that number of outgoing arcs varies very heavily, from two or three to several thousands. So compute threads are loaded very ineffectively.
I try to share the job through shared local memory queue
int tx = threadIdx.x;
extern __shared__ int smem[];
int *stateSet_s = smem; //new active set
int *arcSet_s = &(smem[Q_LEN]); //local shared queue
float *scores_s = (float*)&(smem[2*Q_LEN]);
__shared__ int arcCnt;
__shared__ int stateCnt;
if ( tx == 0 )
{
arcCnt = 0;
stateCnt = 0;
}
__syncthreads();
//load state index from compacted list of state indexes
int stateId = activeSetIn_g[gtx];
float srcCost = scores_g[ stateId ];
int startId = outputArcStartIds_g[stateId];
int nArcs = outputArcCounts_g[stateId]; //number of outgoing arcs to be propagated (2-3 to thousands)
/////////////////////////////////////////////
/// prepare arc set
/// !!!! that is the troubled code I think !!!!
/// bank conflicts? uncoalesced access?
int myPos = atomicAdd ( &arcCnt, nArcs );
while ( nArcs > 0 ) && ( myPos < Q_LEN ) )
{
scores_s[myPos] = srcCost;
arcSet_s[myPos] = startId + nArcs - 1;
myPos++;
nArcs--;
}
__syncthreads();
//////////////////////////////////////
/// parallel propagate arc set
if ( arcSet_s[tx] > 0 )
{
FstArc arc = arcs_g[ arcSet_s[tx] ];
float srcCost_ = scores_s[tx];
DoSomeJob ( &srcCost_ );
int *dst = &(transitionData_g[arc.dst]);
int old = atomicMax( dst, FloatToInt ( srcCost_ ) );
////////////////////////////////
//// new active set
if ( old == ILZERO )
{
int pos = atomicAdd ( &stateCnt, 1 );
stateSet_s[ pos ] = arc.dst;
}
}
/////////////////////////////////////////////
/// transfer new active set from smem to gmem
__syncthreads();
__shared__ int gPos;
if ( tx == 0 )
{
gPos = atomicAdd ( activeSetOutSz_g, stateCnt );
}
__syncthreads();
if ( tx < stateCnt )
{
activeSetOut_g[gPos + tx] = stateSet_s[tx];
}
__syncthreads();
But it runs very slow, I mean slower then if no active set is used (active set = all states), though active set is 10 – 15 percent of all states. Register pressure raised heavily, occupancy is low, but I don’t think anything can be done about it.
May be there are more effective ways of job sharing among threads? A think about warp-shuffle ops on 3.0, but I have to use 2.x devices.
Usually problems with uneven workload and dynamic work creation are addressed using multiple CUDA kernel calls. This can be done by making CPU loop like the following:
doYourComputationKernel()
must have an heuristic to know when it is a good time to stop and send control back to CPU to balance the workload. This can be done by using a global counter for the number of idle blocks. This counter is incremented every time a block finishes its work or cannot create more work. When the number of idle blocks exceed a threshold, the work in all blocks is saved to global memory and all blocks finish.loadBalanceKernel()
should receive the global array with all saved work and another global array of work counters per block. A reduce operation on the later can calculate the total number of works. With this the number of works per block can be found. Finally, the kernel should copy the work so every block receive the same number of elements.The loop continues until all computation is done. There's a good paper about this: gProximity: Hierarchical GPU-based Operations for Collision and Distance Queries. The idea is to balance the load of continuous collision detection which is very uneven.