Degraded full text search performance on larger datasets w/ partial matching #172

jpcamara · 2024-01-17T13:33:26Z

I was experimenting with table size from the "Implementing Full Text Search (FTS)" chapter and I found that at 25 million rows + prefix: true, the queries start to take a big perf hit (3-5 seconds).

So for instance:

Modify db/scripts/bulk_load.sh to use FROM GENERATE_SERIES(1, 25000000) seq; instead of FROM GENERATE_SERIES(1, 10000000) seq;
Using the generated column:

pg_search_scope :search_by_full_name,
  against: :searchable_full_name, # stored generated column tsvector
  using: {
    tsearch: {
      dictionary: 'english',
      tsvector_column: 'searchable_full_name'
    }
  }

On this ☝🏼, the performance is pretty good. I see anywhere from 80ms to 150ms.

Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
  User Load (72.9ms)  SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1  [["LIMIT", 5]]
 => 114.21500006690621

But if I modify it to use prefix: true, so I can match on partial words, the performance takes a nosedive:

pg_search_scope :search_by_full_name,
  against: :searchable_full_name, # stored generated column tsvector
  using: {
    tsearch: {
      dictionary: 'english',
      tsvector_column: 'searchable_full_name',
      prefix: true
    }
  }

Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
  User Load (3771.9ms)  SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1  [["LIMIT", 5]]
 => 3779.0429999586195

At 10 million records, even with prefix: true, the performance stays pretty good. But as I ramped it up, around 25 million rows things get pretty bad and i'm curious if others have ideas on how to tune this for Rideshare?

The text was updated successfully, but these errors were encountered:

andyatkinson · 2024-01-17T19:48:33Z

Thanks for posting @jpcamara. I'm wondering at 25mm records vs. 10mm, if it's exceeding what can be stored in the shared buffers. The query plan may be “tipping” meaning it's doing something different like moving from an index scan to a sequential scan. We'd want a execution plan from a query with both amounts of row data, if you're able to set that up.

Also, I presume you're talking about "warmed up" query performance (meaning, repeatedly running the same query with the same parameters) vs. cold performance. When buffers/pages are accessed (PostgreSQL storage) via queries, results are placed into shared buffers.

When we run the query again with the same parameters, it will be a "warm" query when it reads from the cache, and usually performance is dramatically improved when the result is fully from memory. shared_buffers is a least recently used (LRU) cache, meaning older stuff is evicted. When shared_buffers are accessed, we’ll see higher counts for "hit" (meaning a cache "hit") vs. "read" in query execution plans. The word "read" means the pages were not in shared_buffers.

Check this out for more info: https://pganalyze.com/blog/5mins-explain-analyze-buffers-nested-loops

To see what's happening, can you run the query with EXPLAIN (ANALYZE, BUFFERS) prepended to it? That way we can see what the query plan is, whether it's using an index in both cases, and how many pages are coming from the shared buffer cache.

Keep in mind that for any bulk loading operations whether you're loading 10mm or 25mm records, run ANALYZE on the table afterwards so that PostgreSQL updates table row stats.

pg_search can generate complex SQL queries. To get the SQL, append to_sql() onto the Active Record query code. Then I usually take the SQL into psql and run “EXPLAIN (ANALYZE, BUFFERS) ”, although you can also run this from Active Record in newer versions .explain(:analyze, :buffers) as well.

jpcamara · 2024-01-24T02:24:37Z

@andyatkinson all great questions, many of which I should have provided in the initial issue 🤦🏼‍♂️ ! I will follow up with some more detailed analysis soon, thanks!

andyatkinson · 2024-02-07T22:17:40Z

@jpcamara Following up on this. Any thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Degraded full text search performance on larger datasets w/ partial matching #172

Degraded full text search performance on larger datasets w/ partial matching #172

jpcamara commented Jan 17, 2024

andyatkinson commented Jan 17, 2024 •

edited

Loading

jpcamara commented Jan 24, 2024

andyatkinson commented Feb 7, 2024

Degraded full text search performance on larger datasets w/ partial matching #172

Degraded full text search performance on larger datasets w/ partial matching #172

Comments

jpcamara commented Jan 17, 2024

andyatkinson commented Jan 17, 2024 • edited Loading

jpcamara commented Jan 24, 2024

andyatkinson commented Feb 7, 2024

andyatkinson commented Jan 17, 2024 •

edited

Loading