Positional matching in Postgres

Positional Queries

One of the fundamental operations for astronomical databases is finding objects by position.  Two common scenarios are where we may want all of the objects near to a specified location or a positional cross-correlation between two tables.  There are many ways to do these and there are some special products available in Postgres for spatial indices. 

In this article we look at 5 different ways for doing positional searches in Postgres:

  1. Haversine: Using the Haversine formula with tables clustered on an index in declination.  A simple constraint on declination is included to make the index useful.
  2. Unit Vector: Using the dot product of unit vectors with tables clustered on an index on Dec.  A simple constraint on declination is included to make the index useful.
  3. PostGIS: Using ST_DWithin with tables where a PostGIS geography object has been added representing the RA/Dec of each row.  Tables are clustered on an index using the geography object.
  4. PGSphere: Using the includes operator (~) in PGSphere with tables where an spoint object has been added representing  the position of each row.  Tables are clustered on an index on the spoint object.
  5. Q3C: Using the q3c_join  function in Q3C with tables clustered on an index on the q3c_ang2ipix(ra,dec).  In this case no column was added to the table.

Tables Used

We created three tables optimized for each search scheme.  These include stripped down versions of the Swift observations table and XMM optical monitor object catalogs with 76,301 and 2,880,728 rows respectively.  Most of the columns other than positions have been removed, but a couple of id columns are retained.  A copy of one of the HEASARC’s master position table was also made. This contains the table name, position and default search radius for each row in every table in the HEASARC database that contains a position, excluding tables with a default search radius larger than 1 degree.  This table contains about 36,000,000 rows. Since some schemes could handle null positions and others could not, there was a slight variation here.

After each table was created, a single index was created, the table was clustered on the index and a vacuum analyze was performed.

The Queries

We ran queries on multiple positions on our master positions tables looking for how many matches we’d have in the HEASARC tables for each scheme.  The positions table has fields table_name, default_search_radius, ra and dec. The queries look something like:

   select table_name,count(*) from positions where
   ... position is within 1 degree of the specified position ...
   and
   ... position is within the default search radius of the specified position...
   group by table_name order by table_name
 

Since the default search radius is less than 1 degree, the first constraint does an initial limit to the table. Where I have looked, queries which have only the second constraint which uses the per row search radius cannot use the indexes efficiently.

For the Haversine and Unit vector approaches we also included a constraint like

 dec between dec0-1 and dec0+1 

where dec0 was the requested declination. This is the constraint that takes advantage of the index.

See the details below for the syntax used in each approach to do the positional constraint.

Note: Initially we used a positional grid with integer-valued grid points. This led to differences between the methods. Some of the data were at half-integer grid points and some table positions were exactly the default search radius away from the grid. Depending upon roundoff, which varied from method to method, the position might or might not be included in the query results. We added pi/20 to each grid coordinate to minimize this problem. Results of all five methods using this offset grid were identical.

A grid of 162 positions were queried. The times given below in the Tables Time are the times to do all 162 queries.

A correlation test did a cross-match of the Swift master table and the XMM Optical Monitor object table looking for pairs within 0.01 degrees. The times and resulting counts are given in the table.

All tests were run on Postgres 9.2.1 on in Linux using a Quad core machine with 2.66 GHz cores. However queries were run purely sequentially.

Results

Each test was run multiple times. In many cases the subsequent queries showed evidence of improvement due to caching of some elements of the query. In practice users often run the same query multiple times (e.g., asking for different formats) or very similar queries, so that the better cached performance is a real advantage.

We restarted the database before running the tests and we show times for the first query and then the average of three subsequent reruns. However we cycled through each of the 5 sets of tables first before rerunning any query.

Approach Ext. Lib. Index Tables Time X-Corr time X-Corr Count
Initial Subseq. Initial Subseq.
Haversine No dec 134 129 77.5 77.5 228,111
Unit Vector No dec 60.3 60.1 37.9 37.4 228,111
PostGis Yes Point(ra,dec) 33.8 3.8 39.6 5.7 228,111
PGSphere Yes spoint(ra,dec) 1.3 1.3 34.5 9.7 228,114
Q3C Yes q3c_ang2ipix(ra,dec) 3.7 3.6 7.9 2.1 228,111

The results are summarized in the figure. Note the logarithmic y-axis.

Just to give some perspective on these results I tried to run the same correlation in TOPCAT. An attempt to duplicate the crossmatch failed due to memory constraints.
A cross-match using similar criteria but between Swift and a 900,000 row table took about 10 seconds (not including time to load the tables) on the same hardware. It suggests TOPCAT’s performance might be comparable to the Unit vector approach. However we have not pursued this seriously.

Notes on the approaches

Haversine

We use the standard Haversine formula and add a simple constraint on declination which enables the index on dec to speed up the query.

SQL (for cross-correlation query):

select count(*) from swiftdec a, xmmdec b
    where
a.dec between b.dec - .01 and b.dec + .01 and
( sin(radians(b.dec-a.dec)/2)^2 + cos(radians(a.dec))*cos(radians(b.dec))*sin(radians(b.ra-a.ra)/2)^2
      <
  sin(radians(.01)/2)^2
)

Unit Vector

In this approach the vector of the position on the unit sphere is included which enables an exact constraint on distance to be done without any trigonometric functions used.  The value on the right of the > is simply cos(.01deg).  This requires no special library but adds three columns to the table.

SQL:

select count(*) from swiftunit a, xmmunit b
where 
   a.dec between b.dec -.01 and b.dec+.01 and
   a.__x*b.__x + a.__y*b.__y + a.__z*b.__z > 0.999999984769129

PostGIS

Since the PostGIS function return results by default in meters on the spheroid we set the last parameter of the st_dwithin to false to do calculations on a sphere. Apparently PostGIS uses an Earth radius of 6371.0087714 km.  We use that to convert .01 degrees to a distance in meters.  A user defined spherical projection to the unit sphere might be easy to define and would get rid of this mysterious constant from our calculations.

PostGIS version 2.0.1 was used. PostGIS is the heaviest weight of the three libraries to install. It has a lot of features and extensive documentation-though It is unclear how many of the features are useful to us. It requires several other libraries to be downloaded and installed locally first.

SQL:

select count(*) from swiftgeog a, xmmgeog b 
where
    st_dwithin(a.geog,b.geog,1111.950792, false)

PGSphere

In PGSphere the ~ operator indicates the the right hand operator includes the left hand operator. There are many possible ways to construct the constraint using PGSphere operators, but only some of them seem to use the index we had made on the pnt field.

The PGSphere cross-correlation gave a slightly different result than the other approaches.

It is unclear if PGSphere is being actively maintained. The last version (V1.1.1) is from early 2010. Some tweaking of the code was required to get it to run in Postgres 9.2. In particular the languages specified for various functions needed to be converted to lower case. PGSphere has the clearest documentation (though we found an error in the example of how to define indexes).

SQL:

select count(*) from swiftpgs a, xmmpgs b 
where 
    scircle(a.pnt, radians(.01))~b.pnt

Q3C

The tables include an index on q3c_ang2ipix(ra,dec). However examples given in the Q3C README file suggested creating the functional index without saving this as a value in the table, so we did not create a column with this value.

The latest version (1.4.13) of Q3C is from early 2012. No issues were found in compiling Q3C into Postgres, but it has limited documentation. The README file seems to be it.

SQL:

select count(*) from swiftq3c a, xmmq3c b 
where q3c_join(a.ra,a.dec,b.ra,b.dec,.01)

Sensitivity to Query Details

Some amount of tweaking of the queries was performed to ensure reasonable performance. It was easy to write queries in ways that did not take any advantage of the indices and required a full table scan. E.g., when doing our positional searches using Q3C, we got dramatic differences based upon whether specific a constraint like
where q3c_join(187.25, 2.05, ra,dec, 1)
versus
where q3c_join(ra,dec, 187.25, 2.05, 1)
I had to try a few different syntaxes in PGSphere before I found one that engaged the indexes and sped up the query.

This suggests that supporting free-form user inputs may be incompatible with efficient querying of the database.

For me, Postgres’ explain functionality was essential in finding efficient queries.

Conclusions

Adding unit vector columns to the table cuts the query time in two. At least in this scenario the simple trigonometric approach is CPU limited so that adding addition fields to the table which reduces the complexity of the offset calculation is worth it.

Much greater reductions are possible using specialized indices. Of the three possibilities studied, Q3C has generally good performance, running about twice as fast the the PostGIS geography type.

The PGSphere library is very fast in doing point queries and but a bit slower than both Q3C and PostGIS for cross-correlations. The slight disagreement in cross-corrleation counts is a bit disconcerting but it’s possible that it is due to the kind of rounding issues we discovered in the positional queries.

For correlations all of the methods using spherical indexing seem to have some startup/caching cost.  The second and subsequent iterations run about three times as fast as the first.  The implication on an operational system are unclear and presumably depend critically upon the query mix.

For the positional queries PostGIS still shows strong evidence of caching, but PGSphere and Q3C do not. Note that the results for the positional queries are for an aggregate 162 queries. The averaged times for individual queries ranged from about 10-300 milliseconds.

Although one should be cautious extrapolating from any small number of tests, it does appear that spatial indices substantially improve performance.  We see an improvement of somewhere between a factor of 2-15 in the time taken for queries.

Either Q3C or PostGIS seem like reasonable choices. Q3C gives the better performance and has a more transparent syntax for astronomers. However PostGIS has a much broader community and far greater certainty of continued support. PGSphere’s performance in positional queries is remarkably good but the lack of clear support and variance in results in the cross-correlation are worrying.