-
Notifications
You must be signed in to change notification settings - Fork 106
Notes on Performance Optimization
Despite the considerable size of this data set, we have found that it is possible to serve all required queries in an acceptable timeframe (~ < 500 ms) with proper indexing and query construction.
The full results are written up in the issue, but the methods and conclusions are summarized below.
This task was originally tasked as evaluating elastic search for this purpose, and the first step towards that was to find a use case that "broke" postgresql, then the next step would be to see if ES can success for that use case. But, I could never find a use case that broke postgresql once proper indexes were in place. The process to track down & improve performance is to use EXPLAIN or EXPLAIN ANALYZE and see if a table scan is being performed, and if so what clause was leading to that. Postgresql is very mature and supports many complex index patterns for most complex queries, you're better off assuming that if your query is only returning, say, 100 rows of data, there should be a way to define an index that will return that data very quickly. Keep googling! For example, we found that queries slowed down when adding an ORDER BY clause on a column different from that being queried, but some googling pointed to common table expressions and multicolumn indexes being potential solutions (CTEs didn't work but multicolumn indexes did).
Although a table scan is the biggest reason for performance degradation, it's also possible to have poor performance with an index scan where, due to the structure of the data, a large portion of the index would have to be scanned. For example, when searching for OCCUPATION='SOFTWARE ENGINEER' ORDER BY receipt_dt limit 100, it will return all ~500k results for SOFTWARE ENGINEER from the occupation index, and sort them. A table scan takes 20-40 minutes, but even this index scan takes ~2 minutes due to the large data value (better, but still unacceptable). Also, performance stuff takes time-- it took over a week to figure this out, because it's so slow to iterate there can be many dead ends.
In this case the only true failure mode was that listed above- pairing ORDER BY clauses with other queried fields. The trick here ended up being that multicolumn indexes can work like this: first column is what you query, second column is what you order by "for free" -- i.e., without a sort operation. No special configuration is required for this beyond setting up the multicolumn indexes for each pair of (query field, sort field). In our case we also want results always sorted the same way per record for pagination, so we use sub_id to provide the last layer of sorting.
So, if we wanted to query by recipient_name, transaction_amount, and occupation, and sort by transaction_amount and receipt_date, we would make multicolumn indexes on the following columns (note- order is important).
(recipient_name, receipt_date, sub_id) (transaction_amount, receipt_date, sub_id) (occupation, receipt_date, sub_id) (recipient_name, transaction_amount, sub_id) (transaction_amount, transaction_amount, sub_id) (occupation, transaction_amount, sub_id)