Scalability of GeoHash Based proximity services

75 Views Asked by At

I was reading the Proximity Service chapter in System Design Volume 2 by Alex Xu, where one of the approaches for locating businesses near a specific location involved using "GeoHashes".

However, when using this approach to find nearby businesses, a LIKE query is necessary, such as

WHERE geohash LIKE '9qbwac%'

Despite having an index on the geohash column in Postgres, this query will trigger a full table scan. As the number of businesses grows, a full table scan could significantly slow down the process. Hence, is this approach truly better?

I tried building a dummy table in Postgres with index on the geoHash column and inserted 1 million random entries. Running a WHERE clause with LIKE operator does full table scan.

1

There are 1 best solutions below

0
On

My guess is you didn't set up the right index that would be able to support queries based on like or ^@, as already suggested by @jjanes. Here's an example demo at db<>fiddle, comparing operations based on geohash vs regular PostGIS geometries. For prefix checks, text_pattern_ops will be enough, but I went with pg_trgm to also compare <-> distance/similarity measurements (ignoring the fact that's not really the way to compute distance based on a geohash).

PostGIS:

prepare postgis_based2 as
select id,business_name,x,y,geohash
from business
where st_dwithin((select geom from business where id=42),geom,1)
limit 5;

explain analyze execute postgis_based2;
execute postgis_based2;
QUERY PLAN
Limit (cost=8.71..93.30 rows=5 width=62) (actual time=0.114..0.120 rows=5 loops=1)
  InitPlan 1 (returns $0)
    -> Index Scan using business_pkey on business business_1 (cost=0.29..8.31 rows=1 width=32) (actual time=0.007..0.007 rows=1 loops=1)
          Index Cond: (id = 42)
  -> Index Scan using geom_idx on business (cost=0.41..169.58 rows=10 width=62) (actual time=0.112..0.118 rows=5 loops=1)
        Index Cond: (geom && st_expand($0, '1'::double precision))
        Filter: st_dwithin($0, geom, '1'::double precision)
        Rows Removed by Filter: 1
Planning Time: 0.367 ms
Execution Time: 0.187 ms
id business_name x y geohash
22307 business_22307 165.213429396186 -45.3420063495878 pxvqp98xxvmudmj97vkf
83713 business_83713 164.855609921369 -45.0914942385985 pxvprrk5n0hr95qn5u8w
93720 business_93720 164.990241156698 -45.0564514554435 pxvrdmwppk8tgw7n39vb
42 business_42 164.28489159467 -45.2105222171467 pxuyf3ste7vf53zk17fj
6309 business_6309 164.818000468193 -45.2836125675001 pxvnqs2gcfb9hsmg5tru

Geohash:

prepare geohash_based2 as
select id,business_name,x,y,geohash
from business
where geohash like (select left(geohash,2)||'%' from business where id=42)
limit 5;

explain analyze execute geohash_based2;
execute geohash_based2;
QUERY PLAN
Limit (cost=8.60..26.28 rows=5 width=62) (actual time=0.124..0.229 rows=5 loops=1)
  InitPlan 1 (returns $0)
    -> Index Scan using business_pkey on business business_1 (cost=0.29..8.32 rows=1 width=32) (actual time=0.008..0.008 rows=1 loops=1)
          Index Cond: (id = 42)
  -> Index Scan using business_geohash_idx on business (cost=0.28..1769.03 rows=500 width=62) (actual time=0.123..0.227 rows=5 loops=1)
        Index Cond: (geohash ~~ $0)
Planning Time: 0.117 ms
Execution Time: 0.246 ms
id business_name x y geohash
6455 business_6455 161.767801616675 -45.6640432705416 pxgh1p5ewvre95j5rkyx
987 business_987 157.535974430581 -46.4991816873455 px8p2z44wk22zke1jqpm
21938 business_21938 157.564399796883 -45.6057628837771 pxbh93wzhtegxh8m2kv9
97368 business_97368 162.477336834207 -50.1455647066609 px5d9r22um5qfvt4msbs
42804 business_42804 167.725818809955 -47.204996981719 pxx72ww4yyg0svjhde6e