The problem im facing is segmenting large datasets (up to 2048x2048x40x10000 x,y,z,t == a few terrabytes decompressed, give or take). On the plus side, the features in this dataset are rather small; 20x20x20x20 max or so.
As far as I can see, there are no out of the box solutions for this (correct me if im wrong). I have some plans of how to tackle this, but I would like your feedback.
One timeslice is ~600mb max; less under typical circumstances; I can hold a bunch of consequtive such slices in my 4gb memory.
Given the small size of of my features, my intuition says it is best to eschew all forms of segmentation cleverness, and simply do a local iterative floodfill-like update over my labels; if your neighbor has a higher label, copy it; iterate till convergence. The number of iterations should be bounded by the maximum cluster size in any dimension, which again, should be small.
CUDA has a natural preference for 3D, so I could do this as a two-step process; iterate all 3d-volume slices that have not yet converged. then simply do an elementwise-loop over all consecutive time slices, and perform the same floodfill-logic.
I could initialize the iteration with a simple incrementing unique counter, or find local maxima first, and seed labels there. The latter is to be prefered, so I can keep an array indexed by label to store x,y,z,t minimum/maximum extent of all regions (could do that as postprocessing too). If a region does not extend to the latest timeslice, it is erased from the data, and its location written to a database. If the trailing timeslice has been exhuasted fully that way (sums to zero), drop it from memory. (or if memory overflows, drop the latest as well; the approximation thus made will have to be put up with)
It seems like that should work. Considering the limited size of the z-dimension, would you think it is better to launch x,y,z threadblocks, or launch x,y blocks and let each thread loop over the z-dimension? Is that a 'try and see' kind of thing, or is there a stock answer to that?
Another optimization that just occurred to me; if I load a block of x,y,z into shared memory, would it not be faster to perform several floodfill updates while ive got the memory there anyway? Perhaps its even best to let the local memory iterate to convergence, then move on... it is related to the above issue I suppose. a single neighbor-maximum lookup is probabaly a suboptimal compute-intensity, so either looping over z or iterating several times should offset that. I think I like the latter better.
Another question; it doesnt seem that anything like this exists already, but links to projects containing template code that does similar things would be highly appreciated (optimized 3d floodfill code?), as my CUDA knowledge is spotty still.
Thanks in advance for your thoughts and feedback!
The simplest way is to use 1D storage and overlay a 4D index schema on top of it.
address = x + y * width + z * height * width + t * length * height * width;
This would means that
data( 0 ,0, 0, 0 ) and data( 1, 0, 0, 0 ) are at consecutive addresses, but
data( 0 ,0, 0, 0 ) and data( 0 ,0, 0, 1 ) are width * height * length addresses part.
But if your access patten is processing nearest neighbors then you need spatial locality for your indexing.
This can be achieved using Morton Z (Peano Key) ordering of the indexes. It places nearest neighbors in close linear memory addresses. This Z Order linear address is obtained by interleaving alternate bits of the x,y,z,t indices. For a 2D Example see
http://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableObvious