Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Degraded full text search performance on larger datasets w/ partial matching #172

Open
jpcamara opened this issue Jan 17, 2024 · 3 comments

Comments

@jpcamara
Copy link

I was experimenting with table size from the "Implementing Full Text Search (FTS)" chapter and I found that at 25 million rows + prefix: true, the queries start to take a big perf hit (3-5 seconds).

So for instance:

  • Modify db/scripts/bulk_load.sh to use FROM GENERATE_SERIES(1, 25000000) seq; instead of FROM GENERATE_SERIES(1, 10000000) seq;
  • Using the generated column:
pg_search_scope :search_by_full_name,
  against: :searchable_full_name, # stored generated column tsvector
  using: {
    tsearch: {
      dictionary: 'english',
      tsvector_column: 'searchable_full_name'
    }
  }

On this ☝🏼, the performance is pretty good. I see anywhere from 80ms to 150ms.

Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
  User Load (72.9ms)  SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1  [["LIMIT", 5]]
 => 114.21500006690621

But if I modify it to use prefix: true, so I can match on partial words, the performance takes a nosedive:

pg_search_scope :search_by_full_name,
  against: :searchable_full_name, # stored generated column tsvector
  using: {
    tsearch: {
      dictionary: 'english',
      tsvector_column: 'searchable_full_name',
      prefix: true
    }
  }
Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
  User Load (3771.9ms)  SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1  [["LIMIT", 5]]
 => 3779.0429999586195

At 10 million records, even with prefix: true, the performance stays pretty good. But as I ramped it up, around 25 million rows things get pretty bad and i'm curious if others have ideas on how to tune this for Rideshare?

@andyatkinson
Copy link
Owner

andyatkinson commented Jan 17, 2024

Thanks for posting @jpcamara. I'm wondering at 25mm records vs. 10mm, if it's exceeding what can be stored in the shared buffers. The query plan may be “tipping” meaning it's doing something different like moving from an index scan to a sequential scan. We'd want a execution plan from a query with both amounts of row data, if you're able to set that up.

Also, I presume you're talking about "warmed up" query performance (meaning, repeatedly running the same query with the same parameters) vs. cold performance. When buffers/pages are accessed (PostgreSQL storage) via queries, results are placed into shared buffers.

When we run the query again with the same parameters, it will be a "warm" query when it reads from the cache, and usually performance is dramatically improved when the result is fully from memory. shared_buffers is a least recently used (LRU) cache, meaning older stuff is evicted. When shared_buffers are accessed, we’ll see higher counts for "hit" (meaning a cache "hit") vs. "read" in query execution plans. The word "read" means the pages were not in shared_buffers.

Check this out for more info: https://pganalyze.com/blog/5mins-explain-analyze-buffers-nested-loops

To see what's happening, can you run the query with EXPLAIN (ANALYZE, BUFFERS) prepended to it? That way we can see what the query plan is, whether it's using an index in both cases, and how many pages are coming from the shared buffer cache.

Keep in mind that for any bulk loading operations whether you're loading 10mm or 25mm records, run ANALYZE on the table afterwards so that PostgreSQL updates table row stats.

pg_search can generate complex SQL queries. To get the SQL, append to_sql() onto the Active Record query code. Then I usually take the SQL into psql and run “EXPLAIN (ANALYZE, BUFFERS) ”, although you can also run this from Active Record in newer versions .explain(:analyze, :buffers) as well.

@jpcamara
Copy link
Author

@andyatkinson all great questions, many of which I should have provided in the initial issue 🤦🏼‍♂️ ! I will follow up with some more detailed analysis soon, thanks!

@andyatkinson
Copy link
Owner

@jpcamara Following up on this. Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants