Skip to content

Enabling an expression based sort

Carlo Costino edited this page May 6, 2018 · 15 revisions

Overview

Some of our API endpoints make use of SQL expressions (functions) in the WHERE clause of the underlying SQL statements. In these cases, we should be making use of the same expression in the ORDER BY clause of the SQL statement if we're also sorting (or would like to enable sorting) on the same field(s) found in the WHERE clause.

In order to accomplish this and ensure that the queries remain performant (or become more performant in the case of addressing existing performance issues), we must make use of indexes on expressions. With an index setup on the expression itself we can take advantage of the performance gains in both the WHERE and ORDER BY clauses in the SQL statements generated for the API.

There are three steps for accomplishing this:

  • Setting up the expression-based index(es)
  • Setting up the expression-based sort clause to map to the column(s) exposed for sorting in the API
  • Excluding the expression-based sort mapping from the schema tied to the resource

Let's dive into the details of these three steps using issue #2791 as the basis for the example.

Expression-based indexes

Setting up an expression-based index is identical to a regular index, the only difference is there is a SQL expression in place of a column name. For example, there are COALESCE statements in the WHERE clause for the Schedule B queries:

COALESCE(ofec_sched_b_master.disb_dt, '9999-12-31'::date);

We'll be referencing this statement in the example. But first, note that the API is also keeping track of an index for paging, and in the case of Schedule A and B, a CHECK CONSTRAINT for table partitioning due to the size of those data sets. All of our indexes must keep these things in mind, so they are usually setup as multicolumn indexes for maximum performance with our filtering, paging, and sorting options.

Back to our example with Schedule B, in order to create an index for this statement we would execute this statement:

CREATE INDEX idx_ofec_sched_b_1977_1978_election_cycle_coalesce_disb_dt
  ON public.ofec_sched_b_1977_1978
  USING btree
  (two_year_transaction_period, COALESCE(disb_dt, '9999-12-31'::date), sub_id);

Please note that we would do this for all tables in the Schedule B partition. With an expression-based index created, the WHERE and ORDER BY clauses can now take advantage of it. In this case, it's the COALESCE statement using the disb_dt column.

A full example of this can be seen in PR #2904, where a migration was created to add a group of expression-based indexes for the Schedule B partition.

Expression-based sort clauses

With expression-based indexes in place, any query with an expression in the WHERE clause that matches the index will automatically take advantage of it. The ORDER BY clause will not, however, because of the way the API is setup and how our sorting is exposed in the interface and URL. Only a column name is exposed in the API (both for usability and security, as we don't want to open ourselves up to SQL injection attacks), so we must map the column name to the expression we would actually like to sort by. This is transparent to the user, meaning the API continues to work as it always has by specifying one or more (where supported) sorting columns with an optional - in front to toggle between ascending or descending order.

In order to setup this mapping to enable expression-based sorting for a given field, a hybrid_property must be setup on the model associated with the resource that is mapped to the endpoint being modified. For example, when using the API to view the Schedule B (disbursement) data, you would see this in the URL parameters by default: sort=-disbursement_date. This is the column name exposed in the API, but under the hood it should actually sort by COALESCE(ofec_sched_b_master.disb_dt, '9999-12-31'::date). The API model is what enables us to do this if we add a new property to it.

Continuing with the Schedule B example, take a look at the ScheduleB model. First, we need to make sure we're importing these three things at the top of the file (there will likely be additional import statements):

import marshmallow as ma
import sqlalchemy as sa

from sqlalchemy.ext.hybrid import hybrid_property

With those in place, find the model we want to add the sort expression support to and add this code (again, this is the code built for Schedule B in this example):

@hybrid_property
def sort_expressions(self):
    return {
        'disbursement_date': {
            'expression': sa.func.coalesce(
                self.disbursement_date,
                sa.cast('9999-12-31', sa.Date)
            ),
            'field': ma.fields.Date,
            'type': 'date',
            'null_sort': self.disbursement_date, # This is optional - it will default to the expression itself.
        },
    }

Let's unpack what is going on here:

  • First, we need to use the @hybrid_property decorator because we are making use of a column defined in the model and a SQLAlchemy construct (the reference to coalesce in this case).
  • Second, we create the property as a method and call it sort_expressions. This is important as the sorting code checks for the presence of this property name on a model to see if any columns need to be swapped for expressions.
  • Third, we return a dict in the method that contains the column names found in the model as the keys and an inner dictionary that contains the expression, marshmallow field, and data type we would like to swap them with as the values. Again, the outer keys are the column names that are defined on the model, meaning the model properties, not the database column name itself. In this example, disbursement_date is the model property and disb_dt is the database column name. We need to use disbursement_date because that is the model property and what is exposed in the API.

The inner dictionary always has three parts to it, and one optional part:

  • The expression itself
  • The marshmallow field that the value of the expression would be mapped to
  • The data type of the value in the database itself as a lower case string
  • (OPTIONAL) The means for accounting how to sort NULL values

The expression itself is calculated for its actual value in the pagination code, so we need the marshmallow field because it is used to serialize the value of the expression. Furthermore, we need the data type of the expression because that is used to help construct the ORDER BY clause in the pagination code as well. Both of these items are derived programmatically from a column, but they cannot be derived from an expression, so we must explicitly map them as a part of this.

The optional field, null_sort, is for accounting how to sort in the case of NULL values. In this example with Schedule B, the disbursement_date field can be NULL and since we expose it in the API as NULL as well as in the web site, we need to still be able to sort NULL values even though the expression would never yield a NULL value. In this case, we have to provide an alternative value to sort by so the sorting still works as expected. However, if you are working with an expression that does return a NULL value, then this is not needed and the sorting code will default to also using the expression for sorting NULL values.

If we were going to enable sorting by expression for another column, we would add another key/value pair in the dictionary of the model property and the expression, field, and type to order by.

The sorting and paging code takes care of all of the rest for us, including supporting which direction to sort in. With this property and mapping in place, the expression will be used in the ORDER BY clause, and the ORDER BY clause in turn will make use of the index setup for the expression.

If you are interested in seeing the full implementation of how this is constructed, PR #2905 contains the changeset that enables the underlying support for this.

Excluding the expressions from the schema

Each end point is tied to a resource, which in turn is usually a mapping between a model and a schema. If there is a schema associated with a resource, we need to also ignore the sort_expressions property in the schema as that should not be returned with the results.

Using Schedule B as the example once more, the Schedule B schema is defined as follows in webservices/schemas.py:

ScheduleBSchema = make_schema(
    models.ScheduleB,
    fields={
        'memoed_subtotal': ma.fields.Boolean(),
        'committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
        'recipient_committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
        'image_number': ma.fields.Str(),
        'original_sub_id': ma.fields.Str(),
        'sub_id': ma.fields.Str(),
    },
    options={
        'exclude': (
            'recipient_name_text',
            'disbursement_description_text',
            'recipient_street_1',
            'recipient_street_2',
        ),
        'relationships': [
            Relationship(
                models.ScheduleB.committee,
                models.CommitteeHistory.name,
                'committee_name',
                1
            ),
        ],
    }
)

The exclude key in the options dictionary needs an additional entry in it, sort_expressions:

ScheduleBSchema = make_schema(
    models.ScheduleB,
    fields={
        'memoed_subtotal': ma.fields.Boolean(),
        'committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
        'recipient_committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
        'image_number': ma.fields.Str(),
        'original_sub_id': ma.fields.Str(),
        'sub_id': ma.fields.Str(),
    },
    options={
        'exclude': (
            'recipient_name_text',
            'disbursement_description_text',
            'recipient_street_1',
            'recipient_street_2',
            'sort_expressions',
        ),
        'relationships': [
            Relationship(
                models.ScheduleB.committee,
                models.CommitteeHistory.name,
                'committee_name',
                1
            ),
        ],
    }
)

This will ensure the sort_expressions property does not appear in the JSON output of the API.