Use of Hilvert Curve to query a rectangular area and see if it overlaps other rectangles

504 Views Asked by At

I am looking for a method that can help me in project I am working on. The idea is that there are some rectangles in 2d space, I want to query a rectangular area and see if it overlaps any rectangle in that area. If it does, it fails. If it doesn't, meaning the space is empty, it succeeds.

I was linked to z-order curves to help turn 2d coordinates into 1d. While I was reading about it, I encountered the Hilbert curve. I read that the Hilbert curve is preferred over a z-order curve because it maintains better proximity of points. I also read that the Hilbert curve is used to make more efficient quadtrees and octrees.

I was reading this article https://sigmodrecord.org/publications/sigmodRecord/0103/3.lawder.pdf for a possible solution but I don't know if this applies to my case.

I also saw this comment https://news.ycombinator.com/item?id=14217760 which mentioned multiple index entries for non point objects.

Is there an elegant method where I can use the Hilbert curve to achieve this? Is it possible with just an array of rectangles?

1

There are 1 best solutions below

4
On

I am pretty sure it is possible and can be very efficient. But there is a lot of complexity involved.

I have implemented this for a z-order curve and called it PH-Tree, you find implementations in Java and C++, as well as theoretical background in the link. The PH-Tree is similar to a z-ordered quadtree but with several modifications that allow taking full advantage of the space filling curve.

There are a few things to unpack here.

Hilbert curves vs z-order curves: Yes, Hilbert curves have a slightly better proximity than z-order curves. Better proximity does two things: 1) reduce the number of elements (nodes, data) to look at and 2) improve linear access patterns (less hopping around). Both is important if the spatial index is stored on disk and I/O is expensive (especially old disk drives).

If I remember correctly, Hilbert curve and z-order are similar enough that the always access the same amount of data. The only thing that Hilbert curves are better at is linear access. However, Hilbert curves are much more complicated to calculate, in the tests I made with an in-memory index (not very thorough testing, I admit) I found that z-order curves are considerably more efficient because the reduced CPU time outweighs the cost of accessing data slightly out of order.

Space filling curves (at least Hilbert and z-curve) allow for some neat bit-level hacks that can speed up index operations. However, even for z-ordering I found that getting these right required a lot of thinking (I wrote a whole paper about it). I believe these operations can be adapted for Hilbert curves but I may take some time and, as I wrote above, it may not improve performance at all.

Storing rectangles in a spatial curve index: There are different approaches to encode rectangles in a spatial curve. All approaches that I am aware of encode the rectangle in a multi-dimensional point. I found the following encoding to work best (assuming axis aligned rectangles). We defined the rectangle by the lower left minimum-corner and the upper right maximum corner, e.g. min={min0, min1}={2,3}/max={max0, max1}={8,4}. We transform this (by interleaving the min/max values) into a 4-dimensional point {2,8,3,4} and store this point in the index. Other approaches use a different ordering (2,3,8,4) or, instead of two corners, store the center point and the lengths of the edges.

Querying: If you want to find any rectangles that overlap/intersect with a given region (we call that a window query) we need to create a 4-dimensional query box, i.e. an axis aligned box that is defined by a 4D min-point and 4D max-point (copied from here):

min = {−∞, min_0, −∞, min_1} = {−∞, 2, −∞, 3}

max = {max_0, +∞, max_1, +∞} = {8, +∞, 4, +∞}

We can process the dimensions one by one. The first min/max pair is {−∞, 8}, that will match any 2D rectangle whose minimum x-coordinate is is 8 or lower. All coordinates:

  • d=0: min/max pair is {−∞, 8}: matches any 2D min-x <= 8
  • d=1: min/max pair is {2, +∞}: matches any 2D max-x >= 2
  • d=2: min/max pair is {−∞, 4}: matches any 2D min-y <= 4
  • d=3: min/max pair is {3, +∞}: matches any 2D max-y <= 3

If all these conditions hold true, then the stored rectangle overlaps with the query window.

Final words: This sounds complicated but can be implemented very efficiently (also lends itself to vectorization). I found that is on par with other indexes (quadtree, R-Star-Tree), see some benchmarks I made here. Specifically, I found that the z-ordered indexes have very good insertion/update/removal times (I don't know whether that matters for you) and is very good for small query result sizes (it sounds like you often expect it be zero, i.e. no overlapping rectangle found). It generally works well with large datasets (1000s or millions of entries) and datasets that have strong clusters. For smaller datasets of if you expect to typically find many result (you can of course abort a query early once you find the first match!) other index types may be better.

On a last note, I found the dataset characteristics to have considerable influence on which index worked best. Also, implementation appears to be at least as important as the underlying algorithms. Especially for quadtrees I found huge variations in performance from different implementations.