Segmentation of 4D data on GPU

1k Views Asked by At

The problem im facing is segmenting large datasets (up to 2048x2048x40x10000 x,y,z,t == a few terrabytes decompressed, give or take). On the plus side, the features in this dataset are rather small; 20x20x20x20 max or so.

As far as I can see, there are no out of the box solutions for this (correct me if im wrong). I have some plans of how to tackle this, but I would like your feedback.

One timeslice is ~600mb max; less under typical circumstances; I can hold a bunch of consequtive such slices in my 4gb memory.

Given the small size of of my features, my intuition says it is best to eschew all forms of segmentation cleverness, and simply do a local iterative floodfill-like update over my labels; if your neighbor has a higher label, copy it; iterate till convergence. The number of iterations should be bounded by the maximum cluster size in any dimension, which again, should be small.

CUDA has a natural preference for 3D, so I could do this as a two-step process; iterate all 3d-volume slices that have not yet converged. then simply do an elementwise-loop over all consecutive time slices, and perform the same floodfill-logic.

I could initialize the iteration with a simple incrementing unique counter, or find local maxima first, and seed labels there. The latter is to be prefered, so I can keep an array indexed by label to store x,y,z,t minimum/maximum extent of all regions (could do that as postprocessing too). If a region does not extend to the latest timeslice, it is erased from the data, and its location written to a database. If the trailing timeslice has been exhuasted fully that way (sums to zero), drop it from memory. (or if memory overflows, drop the latest as well; the approximation thus made will have to be put up with)

It seems like that should work. Considering the limited size of the z-dimension, would you think it is better to launch x,y,z threadblocks, or launch x,y blocks and let each thread loop over the z-dimension? Is that a 'try and see' kind of thing, or is there a stock answer to that?

Another optimization that just occurred to me; if I load a block of x,y,z into shared memory, would it not be faster to perform several floodfill updates while ive got the memory there anyway? Perhaps its even best to let the local memory iterate to convergence, then move on... it is related to the above issue I suppose. a single neighbor-maximum lookup is probabaly a suboptimal compute-intensity, so either looping over z or iterating several times should offset that. I think I like the latter better.

Another question; it doesnt seem that anything like this exists already, but links to projects containing template code that does similar things would be highly appreciated (optimized 3d floodfill code?), as my CUDA knowledge is spotty still.

Thanks in advance for your thoughts and feedback!

3

There are 3 best solutions below

3
On

The simplest way is to use 1D storage and overlay a 4D index schema on top of it.

address = x + y * width + z * height * width + t * length * height * width;

This would means that

data( 0 ,0, 0, 0 ) and data( 1, 0, 0, 0 ) are at consecutive addresses, but

data( 0 ,0, 0, 0 ) and data( 0 ,0, 0, 1 ) are width * height * length addresses part.

But if your access patten is processing nearest neighbors then you need spatial locality for your indexing.

This can be achieved using Morton Z (Peano Key) ordering of the indexes. It places nearest neighbors in close linear memory addresses. This Z Order linear address is obtained by interleaving alternate bits of the x,y,z,t indices. For a 2D Example see

http://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableObvious

0
On

For the record; ive got a working solution that im happy with; might put it up online somewhere when it is somewhat more mature. Here is a description:

What I do right now is:

  • Find maxima.
  • Label maxima incrementally using atomics
  • Create pointer-image where each pixel points to self
  • Floodfill on pointer image: if neighbor points to larger value, copy his pointer

This gives a watershedded pointer image, with each pointer pointing to a unique label.

Then I do a merging step; if neighboring pixels point to a different label, and their value is over a treshold, I merge their labels, so that maxima that are 'sufficiently connected' do not form their own region.

Then I dereference the pointers, and I have a nice label image. Since I seeded labels only on the maxima, the maximum label is a small number; I allocate a 6xN array of that size, and use atomics to calculate the min/max extents of each feature.

Not using any shared memory at the moment, but its already quite fast. I accidentily ended up doing semi-clever stuff; since I first do a watershed, and then merge neighboring water-shedded regions using a single sweep, I am effectively doing a multi-level algorithm of sorts. Performance is dominated by this handfull of floodfill calls, and two or three merging sweeps. Using shared memory correctly could reduce my memory bandwidth requirements by a factor 5, but ill wait and see until it explicitly manifests itself as a bottleneck. I am already getting a segmentation that is more like what I want than what I could wrangle from scipy, its faster, and uses an order of magnitude less memory, so im happy for now.

1
On

I would suggest a pyramidal processing scheme: you can quickly compute mipmap-like levels of detail where you store the maximum/minimin value of the combined voxels. This should result in something similar to an octree. Then you can start your segmentation only on the interesting areas, which sould give a considerable speedup. This should work pretty well if you have only few and very small segments.

I guess you could also use level sets (which can be implemented on the GPU), but I recon these are a bit harder to implement (google for "CUDA level set").

Do you have to keep track of your segments over time? Do you have any idea how to do that?