-
Notifications
You must be signed in to change notification settings - Fork 106
Enabling an expression based sort
Some of our API endpoints make use of SQL expressions (functions) in the WHERE
clause of the underlying SQL statements. In these cases, we should be making use of the same expression in the ORDER BY
clause of the SQL statement if we're also sorting (or would like to enable sorting) on the same field(s) found in the WHERE
clause.
In order to accomplish this and ensure that the queries remain performant (or become more performant in the case of addressing existing performance issues), we must make use of indexes on expressions. With an index setup on the expression itself we can take advantage of the performance gains in both the WHERE
and ORDER BY
clauses in the SQL statements generated for the API.
There are three steps for accomplishing this:
- Setting up the expression-based index(es)
- Setting up the expression-based sort clause to map to the column(s) exposed for sorting in the API
- Excluding the expression-based sort mapping from the schema tied to the resource
Let's dive into the details of these three steps using issue #2791 as the basis for the example.
Setting up an expression-based index is identical to a regular index, the only difference is there is a SQL expression in place of a column name. For example, there are COALESCE
statements in the WHERE
clause for the Schedule B queries:
COALESCE(ofec_sched_b_master.disb_dt, '9999-12-31'::date);
We'll be referencing this statement in the example. But first, note that the API is also keeping track of an index for paging, and in the case of Schedule A and B, a CHECK CONSTRAINT
for table partitioning due to the size of those data sets. All of our indexes must keep these things in mind, so they are usually setup as multicolumn indexes for maximum performance with our filtering, paging, and sorting options.
Back to our example with Schedule B, in order to create an index for this statement we would execute this statement:
CREATE INDEX idx_ofec_sched_b_1977_1978_election_cycle_coalesce_disb_dt
ON public.ofec_sched_b_1977_1978
USING btree
(two_year_transaction_period, COALESCE(disb_dt, '9999-12-31'::date), sub_id);
Please note that we would do this for all tables in the Schedule B partition. With an expression-based index created, the WHERE
and ORDER BY
clauses can now take advantage of it. In this case, it's the COALESCE
statement using the disb_dt
column.
A full example of this can be seen in PR #2904, where a migration was created to add a group of expression-based indexes for the Schedule B partition.
With expression-based indexes in place, any query with an expression in the WHERE
clause that matches the index will automatically take advantage of it. The ORDER BY
clause will not, however, because of the way the API is setup and how our sorting is exposed in the interface and URL. Only a column name is exposed in the API (both for usability and security, as we don't want to open ourselves up to SQL injection attacks), so we must map the column name to the expression we would actually like to sort by. This is transparent to the user, meaning the API continues to work as it always has by specifying one or more (where supported) sorting columns with an optional -
in front to toggle between ascending or descending order.
In order to setup this mapping to enable expression-based sorting for a given field, a hybrid_property
must be setup on the model associated with the resource that is mapped to the endpoint being modified. For example, when using the API to view the Schedule B (disbursement) data, you would see this in the URL parameters by default: sort=-disbursement_date
. This is the column name exposed in the API, but under the hood it should actually sort by COALESCE(ofec_sched_b_master.disb_dt, '9999-12-31'::date)
. The API model is what enables us to do this if we add a new property to it.
Continuing with the Schedule B example, take a look at the ScheduleB
model. First, we need to make sure we're importing these three things at the top of the file (there will likely be additional import statements):
import marshmallow as ma
import sqlalchemy as sa
from sqlalchemy.ext.hybrid import hybrid_property
With those in place, find the model we want to add the sort expression support to and add this code (again, this is the code built for Schedule B in this example):
@hybrid_property
def sort_expressions(self):
return {
'disbursement_date': {
'expression': sa.func.coalesce(
self.disbursement_date,
sa.cast('9999-12-31', sa.Date)
),
'field': ma.fields.Date,
'type': 'date',
'null_sort': self.disbursement_date, # This is optional - it will default to the expression itself.
},
}
Let's unpack what is going on here:
- First, we need to use the
@hybrid_property
decorator because we are making use of a column defined in the model and a SQLAlchemy construct (the reference tocoalesce
in this case). - Second, we create the property as a method and call it
sort_expressions
. This is important as the sorting code checks for the presence of this property name on a model to see if any columns need to be swapped for expressions. - Third, we return a
dict
in the method that contains the column names found in the model as the keys and an inner dictionary that contains the expression, marshmallow field, and data type we would like to swap them with as the values. Again, the outer keys are the column names that are defined on the model, meaning the model properties, not the database column name itself. In this example,disbursement_date
is the model property anddisb_dt
is the database column name. We need to usedisbursement_date
because that is the model property and what is exposed in the API.
The inner dictionary always has three parts to it, and one optional part:
- The expression itself
- The marshmallow field that the value of the expression would be mapped to
- The data type of the value in the database itself as a lower case string
-
(OPTIONAL) The means for accounting how to sort
NULL
values
The expression itself is calculated for its actual value in the pagination code, so we need the marshmallow field because it is used to serialize the value of the expression. Furthermore, we need the data type of the expression because that is used to help construct the ORDER BY
clause in the pagination code as well. Both of these items are derived programmatically from a column, but they cannot be derived from an expression, so we must explicitly map them as a part of this.
The optional field, null_sort
, is for accounting how to sort in the case of NULL
values. In this example with Schedule B, the disbursement_date
field can be NULL
and since we expose it in the API as NULL
as well as in the web site, we need to still be able to sort NULL
values even though the expression would never yield a NULL
value. In this case, we have to provide an alternative value to sort by so the sorting still works as expected. However, if you are working with an expression that does return a NULL
value, then this is not needed and the sorting code will default to also using the expression for sorting NULL
values.
If we were going to enable sorting by expression for another column, we would add another key/value pair in the dictionary of the model property and the expression, field, and type to order by.
The sorting and paging code takes care of all of the rest for us, including supporting which direction to sort in. With this property and mapping in place, the expression will be used in the ORDER BY
clause, and the ORDER BY
clause in turn will make use of the index setup for the expression.
If you are interested in seeing the full implementation of how this is constructed, PR #2905 contains the changeset that enables the underlying support for this.
Each end point is tied to a resource, which in turn is usually a mapping between a model and a schema. If there is a schema associated with a resource, we need to also ignore the sort_expressions
property in the schema as that should not be returned with the results.
Using Schedule B as the example once more, the Schedule B schema is defined as follows in webservices/schemas.py:
ScheduleBSchema = make_schema(
models.ScheduleB,
fields={
'memoed_subtotal': ma.fields.Boolean(),
'committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
'recipient_committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
'image_number': ma.fields.Str(),
'original_sub_id': ma.fields.Str(),
'sub_id': ma.fields.Str(),
},
options={
'exclude': (
'recipient_name_text',
'disbursement_description_text',
'recipient_street_1',
'recipient_street_2',
),
'relationships': [
Relationship(
models.ScheduleB.committee,
models.CommitteeHistory.name,
'committee_name',
1
),
],
}
)
The exclude
key in the options
dictionary needs an additional entry in it, sort_expressions
:
ScheduleBSchema = make_schema(
models.ScheduleB,
fields={
'memoed_subtotal': ma.fields.Boolean(),
'committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
'recipient_committee': ma.fields.Nested(schemas['CommitteeHistorySchema']),
'image_number': ma.fields.Str(),
'original_sub_id': ma.fields.Str(),
'sub_id': ma.fields.Str(),
},
options={
'exclude': (
'recipient_name_text',
'disbursement_description_text',
'recipient_street_1',
'recipient_street_2',
'sort_expressions',
),
'relationships': [
Relationship(
models.ScheduleB.committee,
models.CommitteeHistory.name,
'committee_name',
1
),
],
}
)
This will ensure the sort_expressions
property does not appear in the JSON output of the API.