At Coord, we use PostgreSQL as our primary data store, and we take advantage of the wonderful PostGIS extension to store geospatial data like curb geometries. PostGIS gives you a choice of two different ways to store geospatial data:
Geography, where it assumes that your data is made up of points on the earth’s surface, as specified by latitudes and longitudes.
Geography types will give you more accurate results, especially over long distances, but geometry types have better performance. But how big are the differences in real life? And which one should YOU use? I recently ran some experiments here at Coord to answer this question for us. Read on to learn more!
Doing Radius Queries
Even more than accuracy, simplicity is the reason I most want to use geography types in our database. For example, let’s look at search within a given radius, an operation that Coord uses numerous times in our mobility APIs.
Because this radius is specified in kilometers and not in arbitrary projection distances, to perform this query with a geometry type we have to first know how to convert a distance in meters into a projection distance. The query ends up being something like this:
WITH ( c AS ( SELECT ST_Transform(ST_Point(lat, lng)::geography AS center ), d AS ( SELECT ST_Distance( ST_Transform(center::geometry, my_projection), ST_Transform(ST_Translate(center, 1)::geometry, my_projection) ) AS meter_distance FROM c ) SELECT id FROM my_table, c, d WHERE ST_DWithin( my_table.geometry, center, meter_distance * radius_in_meters);
Not only does this query get less accurate the larger radius_in_metersgrows, it’s also easy to get wrong and hard to read.
On the other hand, if my table uses geography, the query is as simple as:
SELECT id FROM my_table WHERE ST_DWithin( My_table.geography, ST_Point(lat, lng)::geography, radius_in_meters);
If possible, I’d really prefer to use the second query. But how much performance would we be giving up?
Running Performance Benchmarks
To decide what the tradeoffs are between geometry and geography, I wanted to run a benchmark, and figure out how much time and database usage each option would cost us on real-world loads. The simplest way to do this is just to use psql with \timing on. But this will give you unreliable results for a few reasons. The main thing to worry about is cold starts. Because of caching at both the DB and OS level, you will get very different results the first time you query a table than subsequent times. We also see some cold start issues when using a brand-new connection. This means that it’s important to run benchmark queries lots of times in order to get valid results. We use a plpgsql script to keep track of the timings in aggregate and write our results out in CSV form.
In order to compare performance on a real-world load, I took our map of San Francisco’s curb geometries and constructed two tables that were identical with one exception: one of them used geometry (using the EPSG 3857 Web Mercator projection) and one of them used geography. We then ran 5,000 queries like the above on each of these tables, using different radii (we also ran a third set of queries that used geography but with use_spheroid set to false, which makes distance calculations faster. I created a geospatial index on both tables, since I wanted to test the indexing behavior of both types as part of the performance assessment.
Here’s what I found:
Unsurprisingly, geometry (the blue dots) is faster than geography (the red and orange dots). We also learned that, once you have enough rows, the performance of both geometry and geography queries scales linearly with the number of rows returned. This suggests that the index-retrieval step of our query is efficient for both types: if we had to do a table scan, or if the index weren’t working well, we’d see a large fixed time penalty regardless of the number of rows we ended up returning. Instead, the thing that ends up taking the time seems to be the distance calculations on the rows we end up returning.
We can get more evidence if we look at the ratio of query time between geography and geometry.
This shows that, for queries returning very few rows, geography takes less than 2x the time of geometry, but as the number of returned rows increases beyond about 750, the performance penalty levels off at 4x. This suggests that, as you’d expect, fixed retrieval costs are more important for small queries than for large ones.
What this means for you
What should you make of this? One important point is that the fewer rows you scan per query, the smaller a penalty you pay for geography. If your query load is dominated by index lookups, the geography penalty may be very close to zero. But if your queries involve computations on complex geographies, or on large numbers of geographic objects, your penalty will be higher.
Plus, probably not all of your workload is geospatial! A 4x penalty may sound like a lot, but if geospatial computation is only 10% of the work you do for each row to begin with, the penalty you pay as a fraction of your total computation time is much lower.
So, if I were starting a new project from scratch, I would use geography regardless of the performance penalty. Why? Remember that you should never optimize prematurely: unless your queries are slow already, or your database load is becoming problematic, it probably doesn’t make sense to try and speed up. And if performance isn’t an issue already, geography is much easier to manage and much harder to screw up.
Once you have your data working with geography, you at least have a baseline: if you do decide to convert your data to geometry, you can at least compare results to ensure the inaccuracy you incur is acceptable. It also allows you to compare performance to understand how much time you are actually saving.
On the other hand, we’re going to stick with geometry. Curbs API calls can easily return thousands of curbs, and the geospatial costs of those queries are truly significant. We’ve already implemented other performance optimizations to avoid hitting the database for a lot of non-geographical curb data, and the advantages we get from using the faster geometry data type are significant.
Do you use PostGIS? Have you had similar performance questions? I’d love to hear from you at firstname.lastname@example.org. Happy mapping!