Analyzing database queries

You may sometimes find yourself needing to either debug or optimize specific queries against the PostgreSQL database made by the API handlers. This section outlines some tips for going about analyzing and fixing these queries.

SQLAlchemy

As described in the Database Schema section, SkyPortal communicates with the PostgreSQL database via the SQLAlchemy ORM. More specifically, most of this is done through the SQLAlchemy Query object API, found here. Database queries are built via the methods in this Query API, and compiled into SQL queries to be sent to the underlying database - in our case PostgreSQL.

For example, a query written like this on the Python side

DBSession().query(Instrument.id).filter(Instrument.id == 1)
# new version for SQLAlchemy 2.0
with DBSession() as session:
    session.execute(sa.select(Instrument.id).where(Instrument.id == 1))

would yield the following Postgres-dialect SQL query

SELECT instruments.id FROM instruments WHERE instruments.id = 1;

A good first step to debugging or looking for inefficiencies in a query made by the SkyPortal backend is to look at the actual SQL generated by SQLAlchemy for the query. To do so, you can do something like this:

from sqlalchemy.dialects import postgresql
query = DBSession().query(Instrument.id).filter(Instrument.id == 1)
print(query.statement.compile(dialect=postgresql.dialect()))

Note the specification of the PostgreSQL dialect when printing the compiled SQL. This ensures that any Postgres-specific conventions or syntax are respected in the conversion to SQL, as described in the SQLAlchemy documentation here.

For example, the following expression prints out two different versions when the equivalent SQL is asked for, based on whether the PostgreSQL dialect is specified or not:

from sqlalchemy.dialects import postgresql
from sqlalchemy import column
expression = column('x') == 'some value'
print(expression)
print(expression.compile(dialect=postgresql.dialect()))

Output:

x = :x_1
x = %(x_1)s

Finally, you should note that in the SkyPortal API handlers, most of the SQLAlchemy Query objects that you would print out in this manner originate from our permissions framework through the query_accessible_rows() method. Thus, if you’re ever lost looking for those DBSession().query()... lines to do this sort of print-debugging, you may want to look for the query_accessible_rows() lines instead. Similarly, if you’re ever wondering where certain parts of the printed out SQL are coming from (especially a series of joins), the access control rules defined in skyportal/models.py for the database model being queried can be a good place to start.

PostgreSQL

Once you have the full SQL query, it can be easier to spot errors/inefficiencies and to experiment with different approaches. It is easier, after all, to directly edit the SQL and try different ways of joining, moving around subqueries, etc. without also trying to figure out how to get SQLAlchemy to produce the exact queries you’re envisioning.

Some quick tips with regards to debugging/optimizing raw SQL:

  • Use a pretty-printer like this one to get nicely indented versions of longer queries to more easily spot mistakes and understand what a query is doing.

  • A missing index is often both the biggest bottleneck and the quickest fix to a slow query. You can easily check which columns are indexed on a given table by running \d table_name within psql.

  • Another common inefficiency in queries is a cross-join, where each row in a table is joined against every row in another table and is rarely what you really want. Cross-joins will show up in the SQL generated by SQLAlchemy with the table_a , table_b syntax. It can be hard to spot those commas, so be on the lookout for them.

EXPLAIN ANALYZE

Finally, the most helpful (and in-depth) way of analyzing the performance of a SQL query is through the EXPLAIN ANALYZE statement. You can connect to the SkyPortal database using psql and then run EXPLAIN ANALYZE [your SQL query here] in order to get a report of the steps PostgreSQL took to carry out your query, and how long each step took. This can often point you toward the changes necessary to speed up your query. For example, you may notice that PostgreSQL is looping through an entire table (a SEQ SCAN) to look for certain rows instead of using an index. More detailed guidance on using the EXPLAIN outputs can be found in the PostgreSQL documentation.