Thread: Helping planner to chose sequential scan when it improves performance
Hi all,
I recently started at a new firm and have been trying to help to grok certain planner behavior. A strip-down example of the sort of join we do in the database looks like this, wherein we join two tables that have about 1 million rows:
-- VACUUM (FULL, VERBOSE, ANALYZE), run the query twice first, then...
EXPLAIN(ANALYZE, VERBOSE, COSTS, SETTINGS, BUFFERS, WAL, TIMING, SUMMARY)
SELECT
ci.conversation_uuid,
ci.item_uuid,
ci.tenant_id,
it.item_subject,
it.item_body_start
FROM
conversation_item AS ci
INNER JOIN item_text AS it ON it.item_uuid = ci.item_uuid;
SELECT
ci.conversation_uuid,
ci.item_uuid,
ci.tenant_id,
it.item_subject,
it.item_body_start
FROM
conversation_item AS ci
INNER JOIN item_text AS it ON it.item_uuid = ci.item_uuid;
-- The necessary DDL that creates these tables and indexes is attached. I've commented out some extra stuff that isn't directly related to the above query.
Depending on config, we get different results in terms of performance (EXPLAIN output attached):
PLAN A (default config, effective cache size just shy of 15GB): 3.829 seconds. A nested loop is used to probe the hash index `conversation_item_item_hash_index` for each row of item_text. Although the cost of probing once is low, a fair amount of time passes because the operation is repeated ~1.3 million times.
PLAN B (enable_indexscan off, effective cache same as before): 3.254 seconds (~15% speedup, sometimes 30%). Both tables are scanned sequentially and conversation_item is hashed before results are combined with a hash join.
PLAN C: (random_page_cost = 8.0, instead of default 4, effective cache same as before): 2.959 (~23% speedup, sometimes 38%). Same overall plan as PLAN B, some differences in buffers and I/O. I'll note we had to get to 8.0 before we saw a change to planner behavior; 5.0, 6.0, and 7.0 were too low to make a difference.
Environment:
Postgres 15.2
Amazon RDS — db.m6g.2xlarge
- In Plan A, what factors are causing the planner to select a substantially slower plan despite having recent stats about number of rows?
- Is there a substantial difference between the on-the-fly hash done in Plan B and Plan C compared to the hash-index used in Plan A? Can I assume they are essentially the same? Perhaps there are there differences in how they're applied?
- Is it common to see values for random_page_cost set as high as 8.0? We would of course need to investigate whether we see a net positive or net negative impact on other queries, to adopt this as a general setting, but is it a proposal we should actually consider?
- Maybe we are barking up the wrong tree with the previous questions. Are there other configuration parameters we should consider first to improve performance in situations like the one illustrated?
- Are there other problems with our schema, query, or plans shown here? Other approaches (or tools/analyses) we should consider?
Attachment
On Tue, Jun 13, 2023 at 10:28 PM Patrick O'Toole <patrick.otoole@sturdy.ai> wrote:
Hi all,
Questions:
- In Plan A, what factors are causing the planner to select a substantially slower plan despite having recent stats about number of rows?
Estimated overall cost. For Plan A it is ~200k. For plans B/C (haven't noticed any differences in these two) it is ~250k. The planner uses a less expensive plan.
Also, in the plans you can see that Pg estimates the number of rows correctly.
- Is there a substantial difference between the on-the-fly hash done in Plan B and Plan C compared to the hash-index used in Plan A? Can I assume they are essentially the same? Perhaps there are there differences in how they're applied?
I don't see any difference in plans B and C, but you report timing changes. To me this looks like just a fluctuation in measurements. So I wouldn't trust any measurements for plan A either.
I'm not a big expert, but can not say that plan A and B are essentially the same.
Plan A: DB scans item_text table and for every record looks into the index of conversation_item table, then looks into the table itself.
Plan B/C: DB scans conversation_item table without looking into its indexes building a hash table on the fly.
- Is it common to see values for random_page_cost set as high as 8.0? We would of course need to investigate whether we see a net positive or net negative impact on other queries, to adopt this as a general setting, but is it a proposal we should actually consider?
No idea.
- Maybe we are barking up the wrong tree with the previous questions. Are there other configuration parameters we should consider first to improve performance in situations like the one illustrated?
Recheck your numbers.
- Are there other problems with our schema, query, or plans shown here? Other approaches (or tools/analyses) we should consider?
You can try the following index:
I believe this index would allow Pg to use "index only scan" as variation of Plan A and avoid touching the conversation_item table completely.
Best regards, Ruslan.
On Tue, Jun 13, 2023 at 3:28 PM Patrick O'Toole <patrick.otoole@sturdy.ai> wrote:
> run the query twice first, then...
Is that a realistic way to run the test? Often forcing all the data needed for the query into memory is going to make things less realistic, not more realistic. Assuming the system has more stuff to do than just perform this one query, it might be unusual for the query to find everything it needs in memory. Also, if you really do want to do it this way, then you should do this for every plan. Different plans need to access a different collections of buffers, so prewarming just one plan will privilege that one over the others.
PLAN A (default config, effective cache size just shy of 15GB): 3.829 seconds. A nested loop is used to probe the hash index `conversation_item_item_hash_index` for each row of item_text. Although the cost of probing once is low, a fair amount of time passes because the operation is repeated ~1.3 million times.PLAN B (enable_indexscan off, effective cache same as before): 3.254 seconds (~15% speedup, sometimes 30%). Both tables are scanned sequentially and conversation_item is hashed before results are combined with a hash join.PLAN C: (random_page_cost = 8.0, instead of default 4, effective cache same as before): 2.959 (~23% speedup, sometimes 38%). Same overall plan as PLAN B, some differences in buffers and I/O. I'll note we had to get to 8.0 before we saw a change to planner behavior; 5.0, 6.0, and 7.0 were too low to make a difference.
The difference between B and C looks like it is entirely noise, having to do with how many buffers it found already in the cache and how many of them needed cleaning (which causes the buffer to be dirty as the cleaned version now needs to be written to disk) and how many dirty buffers it found that needed to be written in order to make way to read other buffers it needs. (This last number most generally reflects dirty buffers left around by other things which this query encountered, not the buffers the query itself dirtied). None of this is likely to be reproducible, and so not worth investigating.
And the difference between A and BC is small enough that it is unlikely to be worth pursuing, either, even if it is reproducible. If your apps runs this one exact query often enough that a 30% difference is worth worrying about, you would probably be better served by questioning the business case. What are you doing with 1.4 million rows once you do fetch them, that it needs to be repeated so often?
If you think that taking a deep dive into this one query is going to deliver knowledge which will pay off for other (so far unexamined) queries, I suspect you are wrong. Look for queries where the misestimation is more stark than 30% to serve as your case studies.
Environment:Postgres 15.2Amazon RDS — db.m6g.2xlargeQuestions:
In Plan A, what factors are causing the planner to select a substantially slower plan despite having recent stats about number of rows?
Even if it were worth trying to answer this (which I think it is not), there isn't much we can do with dummy tables containing no data. You would need to include a script to generate data of a size and distribution which reproduces the given behavior.
> Is there a substantial difference between the on-the-fly hash done in Plan B and Plan C compared to the hash-index used in Plan A? Can I assume they are essentially the same? Perhaps there are there differences in how they're applied?
They are pretty much entirely different. Once jumps all over the index on disk, the other reads the table sequentially and (due to work_mem) parcels it out into chunks where it expects each chunk can also be read back in sequentially as well. About the only thing not different is that they both involve computing a hash function.
> Is it common to see values for random_page_cost set as high as 8.0? We would of course need to investigate whether we see a net positive or net negative impact on other queries, to adopt this as a general setting, but is it a proposal we should actually consider?
I've never needed to set it that high, but there is no a priori reason it wouldn't make sense to do. Settings that high would probably only be suitable for HDD (rather than SSD) storage and when caching is not very effective, which does seem to be the opposite of your situation. So I certainly wouldn't do it just based on the evidence at hand.
Cheers,
Jeff
On Wed, 14 Jun 2023 at 07:28, Patrick O'Toole <patrick.otoole@sturdy.ai> wrote: > Maybe we are barking up the wrong tree with the previous questions. Are there other configuration parameters we shouldconsider first to improve performance in situations like the one illustrated? random_page_cost and effective_cache_size are the main settings which will influence plan A vs plan B. Larger values of effective_cache_size will have the planner apply more seq_page_costs to the index scan. Lower values of effective_cache_size will mean more pages will be assumed to cost random_page_cost. David
On Sun, Jun 25, 2023 at 3:48 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 14 Jun 2023 at 07:28, Patrick O'Toole <patrick.otoole@sturdy.ai> wrote:
> Maybe we are barking up the wrong tree with the previous questions. Are there other configuration parameters we should consider first to improve performance in situations like the one illustrated?
random_page_cost and effective_cache_size are the main settings which
will influence plan A vs plan B. Larger values of
effective_cache_size will have the planner apply more seq_page_costs
to the index scan.
Squeezing otherwise-random page costs towards seq_page_costs is what bitmap scans do, and what large index scans with high pg_stats.correlation do. But effective_cache_size does something else, it squeezes the per page costs towards zero, not towards seq_page_costs. This is surely not accurate, as the costs of locking the buffer mapping partition, finding the buffer or reading it from the kernel cache if not found, maybe faulting the buffer from main memory into on-CPU memory, pinning the buffer, and read-locking it are certainly well above zero, even if not nearly as high as seq_page_cost. I'd guess they are truly about 2 to 5 times a cpu_tuple_cost per buffer. But zero is what they currently get, there is no knob to twist to change that.
Lower values of effective_cache_size will mean
more pages will be assumed to cost random_page_cost.
Sure, but it addresses the issue only obliquely (as does raising random_page_cost) not directly. So the change you need to make to them will be large, and will likely make other things worse.
Cheers,
Jeff