I have a question regarding the way to partition the spaces in kd-tree algorithm.
Assuming I have points in the plane, with (x,y) coordinate. Assuming we're not in a particular situation when points are in the same line. I was thinking why we need to alternate the splitting coordinate, at one level, use x axis, the following level, use y axis. What matters if we use only x direction to split spaces, we always have a binary tree, and search algorithm always take log(n) in average (assuming we have relatively well balanced tree).
What give me more when I split space by alternating splitting directions? I wonder if it's related to some general probabilistic properties in multi-dimension?
I think the problem comes when you start searching the tree, for example with a window query (rectangular query).
Lets assume a rectangular dataset with evenly distributed points between
-1,000
and1,000
in every direction. If you sort byx
, then a query that should return all point with(-900 < x < 900) && (1 < y < 10)
may have to scan almost the whole tree. Thelog(n)
search would only work if your query only limitsx
and noty
, i.e.(1<x<10) && (-inf<y<+inf)
.