ConeSearch usage filtering

For the past several weeks we have been seeing extremely heavy usage (>~ 1 request per second) with several distinct users sending tens of thousands of cone search requests in rapid succession. Unfortunately, they generally seem to be sending a series of simultaneous requests until things back up, rather than doing one or two streams of requests waiting for one request to finish before starting another. Generally we’ve been able to respond to these requests, but the response time has been awful and the system load has made the interactive interface largely unusable.

To address this we’ve added in a filtering of input cone search requests. If we have more than two requests going on a given server for the same table, then the new request will immediately fail. Since we have two servers there may still be as many as four queries executing on the same table.

In the few hours since we’ve implemented this we’ve noted that the interactive interfaces are doing much better, though not as crisp as we might like. In the near future we will be moving the database to a much faster machine which should address these problems more definitively.

Heavy Usage

This morning we had a user making very heavy usage of Xamin sending multiple cone search requests per second. Requests piled up in the database and this slowed down the system considerably. Requests that normally took a second could take several minutes. Metadata queries in the interactive versions of Xamin would often time out so that the Web interface was not really working.
We have temporarily disabled access for this IP address.

We’ll be looking at how we can meter usage more effectively in the future and apologize to those adversely affected.

Xamin 1.7 released

A new information pane is now included in the main Xamin web interface. This pane shows information about groups of tables and the detailed table documentation. Previously this information had been displayed in the bottom of the main query pane. The new arrangement makes it much easier to browse table documentation.

Positional matching in Postgres

Positional Queries

One of the fundamental operations for astronomical databases is finding objects by position.  Two common scenarios are where we may want all of the objects near to a specified location or a positional cross-correlation between two tables.  There are many ways to do these and there are some special products available in Postgres for spatial indices. 

In this article we look at 5 different ways for doing positional searches in Postgres:

  1. Haversine: Using the Haversine formula with tables clustered on an index in declination.  A simple constraint on declination is included to make the index useful.
  2. Unit Vector: Using the dot product of unit vectors with tables clustered on an index on Dec.  A simple constraint on declination is included to make the index useful.
  3. PostGIS: Using ST_DWithin with tables where a PostGIS geography object has been added representing the RA/Dec of each row.  Tables are clustered on an index using the geography object.
  4. PGSphere: Using the includes operator (~) in PGSphere with tables where an spoint object has been added representing  the position of each row.  Tables are clustered on an index on the spoint object.
  5. Q3C: Using the q3c_join  function in Q3C with tables clustered on an index on the q3c_ang2ipix(ra,dec).  In this case no column was added to the table.

Tables Used

We created three tables optimized for each search scheme.  These include stripped down versions of the Swift observations table and XMM optical monitor object catalogs with 76,301 and 2,880,728 rows respectively.  Most of the columns other than positions have been removed, but a couple of id columns are retained.  A copy of one of the HEASARC’s master position table was also made. This contains the table name, position and default search radius for each row in every table in the HEASARC database that contains a position, excluding tables with a default search radius larger than 1 degree.  This table contains about 36,000,000 rows. Since some schemes could handle null positions and others could not, there was a slight variation here.

After each table was created, a single index was created, the table was clustered on the index and a vacuum analyze was performed.

The Queries

We ran queries on multiple positions on our master positions tables looking for how many matches we’d have in the HEASARC tables for each scheme.  The positions table has fields table_name, default_search_radius, ra and dec. The queries look something like:

   select table_name,count(*) from positions where
   ... position is within 1 degree of the specified position ...
   and
   ... position is within the default search radius of the specified position...
   group by table_name order by table_name
 

Since the default search radius is less than 1 degree, the first constraint does an initial limit to the table. Where I have looked, queries which have only the second constraint which uses the per row search radius cannot use the indexes efficiently.

For the Haversine and Unit vector approaches we also included a constraint like

 dec between dec0-1 and dec0+1 

where dec0 was the requested declination. This is the constraint that takes advantage of the index.

See the details below for the syntax used in each approach to do the positional constraint.

Note: Initially we used a positional grid with integer-valued grid points. This led to differences between the methods. Some of the data were at half-integer grid points and some table positions were exactly the default search radius away from the grid. Depending upon roundoff, which varied from method to method, the position might or might not be included in the query results. We added pi/20 to each grid coordinate to minimize this problem. Results of all five methods using this offset grid were identical.

A grid of 162 positions were queried. The times given below in the Tables Time are the times to do all 162 queries.

A correlation test did a cross-match of the Swift master table and the XMM Optical Monitor object table looking for pairs within 0.01 degrees. The times and resulting counts are given in the table.

All tests were run on Postgres 9.2.1 on in Linux using a Quad core machine with 2.66 GHz cores. However queries were run purely sequentially.

Results

Each test was run multiple times. In many cases the subsequent queries showed evidence of improvement due to caching of some elements of the query. In practice users often run the same query multiple times (e.g., asking for different formats) or very similar queries, so that the better cached performance is a real advantage.

We restarted the database before running the tests and we show times for the first query and then the average of three subsequent reruns. However we cycled through each of the 5 sets of tables first before rerunning any query.

Approach Ext. Lib. Index Tables Time X-Corr time X-Corr Count
Initial Subseq. Initial Subseq.
Haversine No dec 134 129 77.5 77.5 228,111
Unit Vector No dec 60.3 60.1 37.9 37.4 228,111
PostGis Yes Point(ra,dec) 33.8 3.8 39.6 5.7 228,111
PGSphere Yes spoint(ra,dec) 1.3 1.3 34.5 9.7 228,114
Q3C Yes q3c_ang2ipix(ra,dec) 3.7 3.6 7.9 2.1 228,111

The results are summarized in the figure. Note the logarithmic y-axis.

Just to give some perspective on these results I tried to run the same correlation in TOPCAT. An attempt to duplicate the crossmatch failed due to memory constraints.
A cross-match using similar criteria but between Swift and a 900,000 row table took about 10 seconds (not including time to load the tables) on the same hardware. It suggests TOPCAT’s performance might be comparable to the Unit vector approach. However we have not pursued this seriously.

Notes on the approaches

Haversine

We use the standard Haversine formula and add a simple constraint on declination which enables the index on dec to speed up the query.

SQL (for cross-correlation query):

select count(*) from swiftdec a, xmmdec b
    where
a.dec between b.dec - .01 and b.dec + .01 and
( sin(radians(b.dec-a.dec)/2)^2 + cos(radians(a.dec))*cos(radians(b.dec))*sin(radians(b.ra-a.ra)/2)^2
      <
  sin(radians(.01)/2)^2
)

Unit Vector

In this approach the vector of the position on the unit sphere is included which enables an exact constraint on distance to be done without any trigonometric functions used.  The value on the right of the > is simply cos(.01deg).  This requires no special library but adds three columns to the table.

SQL:

select count(*) from swiftunit a, xmmunit b
where 
   a.dec between b.dec -.01 and b.dec+.01 and
   a.__x*b.__x + a.__y*b.__y + a.__z*b.__z > 0.999999984769129

PostGIS

Since the PostGIS function return results by default in meters on the spheroid we set the last parameter of the st_dwithin to false to do calculations on a sphere. Apparently PostGIS uses an Earth radius of 6371.0087714 km.  We use that to convert .01 degrees to a distance in meters.  A user defined spherical projection to the unit sphere might be easy to define and would get rid of this mysterious constant from our calculations.

PostGIS version 2.0.1 was used. PostGIS is the heaviest weight of the three libraries to install. It has a lot of features and extensive documentation-though It is unclear how many of the features are useful to us. It requires several other libraries to be downloaded and installed locally first.

SQL:

select count(*) from swiftgeog a, xmmgeog b 
where
    st_dwithin(a.geog,b.geog,1111.950792, false)

PGSphere

In PGSphere the ~ operator indicates the the right hand operator includes the left hand operator. There are many possible ways to construct the constraint using PGSphere operators, but only some of them seem to use the index we had made on the pnt field.

The PGSphere cross-correlation gave a slightly different result than the other approaches.

It is unclear if PGSphere is being actively maintained. The last version (V1.1.1) is from early 2010. Some tweaking of the code was required to get it to run in Postgres 9.2. In particular the languages specified for various functions needed to be converted to lower case. PGSphere has the clearest documentation (though we found an error in the example of how to define indexes).

SQL:

select count(*) from swiftpgs a, xmmpgs b 
where 
    scircle(a.pnt, radians(.01))~b.pnt

Q3C

The tables include an index on q3c_ang2ipix(ra,dec). However examples given in the Q3C README file suggested creating the functional index without saving this as a value in the table, so we did not create a column with this value.

The latest version (1.4.13) of Q3C is from early 2012. No issues were found in compiling Q3C into Postgres, but it has limited documentation. The README file seems to be it.

SQL:

select count(*) from swiftq3c a, xmmq3c b 
where q3c_join(a.ra,a.dec,b.ra,b.dec,.01)

Sensitivity to Query Details

Some amount of tweaking of the queries was performed to ensure reasonable performance. It was easy to write queries in ways that did not take any advantage of the indices and required a full table scan. E.g., when doing our positional searches using Q3C, we got dramatic differences based upon whether specific a constraint like
where q3c_join(187.25, 2.05, ra,dec, 1)
versus
where q3c_join(ra,dec, 187.25, 2.05, 1)
I had to try a few different syntaxes in PGSphere before I found one that engaged the indexes and sped up the query.

This suggests that supporting free-form user inputs may be incompatible with efficient querying of the database.

For me, Postgres’ explain functionality was essential in finding efficient queries.

Conclusions

Adding unit vector columns to the table cuts the query time in two. At least in this scenario the simple trigonometric approach is CPU limited so that adding addition fields to the table which reduces the complexity of the offset calculation is worth it.

Much greater reductions are possible using specialized indices. Of the three possibilities studied, Q3C has generally good performance, running about twice as fast the the PostGIS geography type.

The PGSphere library is very fast in doing point queries and but a bit slower than both Q3C and PostGIS for cross-correlations. The slight disagreement in cross-corrleation counts is a bit disconcerting but it’s possible that it is due to the kind of rounding issues we discovered in the positional queries.

For correlations all of the methods using spherical indexing seem to have some startup/caching cost.  The second and subsequent iterations run about three times as fast as the first.  The implication on an operational system are unclear and presumably depend critically upon the query mix.

For the positional queries PostGIS still shows strong evidence of caching, but PGSphere and Q3C do not. Note that the results for the positional queries are for an aggregate 162 queries. The averaged times for individual queries ranged from about 10-300 milliseconds.

Although one should be cautious extrapolating from any small number of tests, it does appear that spatial indices substantially improve performance.  We see an improvement of somewhere between a factor of 2-15 in the time taken for queries.

Either Q3C or PostGIS seem like reasonable choices. Q3C gives the better performance and has a more transparent syntax for astronomers. However PostGIS has a much broader community and far greater certainty of continued support. PGSphere’s performance in positional queries is remarkably good but the lack of clear support and variance in results in the cross-correlation are worrying.

Batch Queries Moving to Xamin

On October 2, 2012 we will begin sending batch queries to our  Xamin interface.  Internal tests have shown Xamin about twice as fast as the current interface.

The switch will be accomplished by internally redirecting batch queries sent to the current URL to an Xamin URL.  Non-batch queries (used by our Browse interface) are not affected.  The redirection will be done internally within our site so that web clients will not see an HTTP 30x redirection response.

There are some small differences in the formats of data returned discussed in a previous post, but we hope that this transition should be largely invisible to users.  If you want to switch over early, you can query the URL

http://heasarc.gsfc.nasa.gov/xamin/BrowseBatch

rather than the

http://heasarc.gsfc.nasa.gov/cgi-bin/W3Browse/w3query.pl

URL currently invoked.  After next Tuesday batch requests to the w3query.pl URL will be transparently redirected to Xamin.  If you want to invoke the old behavior after next week you can switch to using

http://heasarc.gsfc.nasa.gov/cgi-bin/W3Browse/w3query_noredir.pl

for your batch requests.

Versions of our batch scripts that will point consistently to either Xamin or Browse are available at http://heasarc.gsfc.nasa.gov/FTP/software/web_batch.  We recommend that you use the wget versions of scripts if possible since the simple webquery.pl tool used as an alternative has no ability to negotiate firewalls or redirections.

 

Xamin and batch queries

Over the next couple of weeks we plan for Xamin to take over responsibility for batch queries sent to the HEASARC. In our initial tests the Xamin queries ran about 4 times faster.  While we’ve made some efforts to minimize the differences between Xamin and the current Browse-based results there are still a few.

The differences include:

  • The name of the column giving the offset between the target position and the row in the table is _offset, not Search_offset
  • The value of the offset column does not include the central position.  E.g., if you have asked for rows near the position 00 40 24, 40 32 , the current offset column might contain 0.235 (00 40 24, 40 32) but it will just be 0.235 in the new version.
  • Class codes will be returned for the HEASARC classifications, not the string value corresponding to the class code.
  • The mission name is not given when the list of available tables is shown.  In Xamin not all tables are associated with missions (some are associated with regimes or object types).  We added a ‘archive’ column to keep the format similar.
  • Error messages are very different.  E.g., If a user requests a VOTable output, Browse still gives an HTML error message when there is an error retrieving data, but Xamin returns an VOTable with the error status.  If there are errors when the user is making a FITS or Excel format request, then there may be no output where Browse might have returned an HTML error message.
  • There are numerous detailed format differences in the Excel and FITS outputs.  E.g., Xamin uses a binary table format while Browse uses ASCII tables.

There are some new features.  Since we are limited to the interface that the Xamin batch script provides only a few of the new capabilities of Xamin are available, but a few do leak though e.g.,

  • The fields displayed can contain expressions of fields, not just one or more column names: fields=’SomeText’, ra+dec,cos(ra/57)
  • Results can be sorted by multiple columns (sortvar=col1,col2) and in both ascending and descending order (sortvar=col1:+,col2:-).

Please let us know if you have any problems with the new format.  If you would like to test this out before we release it change the URL the batch script points to from http://heasarc.gsfc.nasa.gov/cgi-bin/W3Browse/w3query.pl to http://heasarc.gsfc.nasa.gov/xamin/BrowseBatch.

 

Xamin passes query milestone

The vast majority of the queries that the HEASARC receives are through automated interfaces that use either the VO cone search protocol or the Batch protocol we developed for Browse (long before the VO). Starting this week cone searches are running the Xamin interfaces. Usually there are more batch interface queries than cone searches, but there is a lot of variation. Yesterday we had a few more cone search requests and as a result Xamin satisfied more query requests than Browse for the first time, just under 10,000 Xamin queries versus just over 7,000 in Browse.

We’re looking at moving the Batch requests to Xamin in the next couple of weeks. Preliminary tests show it is up to 4x faster. When we do that Xamin will be handling over 90% of the HEASARC’s query load.

Xamin 1.6 released

A new version of Xamin, 1.6, was released this morning. The most noticable changes are a new notice facility and links to the Xamin Blog in the help menu. We will use the notice facility to inform users of new features and scheduled downtimes. Error messages are now displayed when SAMP connections (e.g., to TOPCAT or Aladin) fail, and the HTTP headers in data responses now give a suggested time to the next query which is set based upon the current system load. Some modifications to the cone search responses ensure better compliance to the VO Cone Search standard and more compatibility with the format used in Browse-base cone searches.