Re: Postgres Query Plan using wrong index - Mailing list pgsql-general
From | Manikandan Swaminathan |
---|---|
Subject | Re: Postgres Query Plan using wrong index |
Date | |
Msg-id | 2BC8AB39-16D7-4423-BE0A-F0F4EA432E2E@gmail.com Whole thread Raw |
In response to | Re: Postgres Query Plan using wrong index (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Postgres Query Plan using wrong index
|
List | pgsql-general |
Thanks Tom. Since you mentioned the planner not knowing about the correlation between the columns, I’m curious, why doesn’t making amultivariate statistic make a difference? CREATE STATISTICS col_a_col_b_stats (dependencies) ON col_a, col_b FROM test_table; ANALYZE test_table; And the resulting query plan which uses just the index on col_b: postgres=# explain analyze select min(col_b) from test_table where col_a > 4996; Result (cost=62.13..62.14 rows=1 width=4) (actual time=536.648..536.649 rows=1 loops=1) InitPlan 1 -> Limit (cost=0.43..62.13 rows=1 width=4) (actual time=536.641..536.641 rows=1 loops=1) -> Index Scan using idx_col_a_btree on test_table (cost=0.43..254987.43 rows=4133 width=4) (actual time=536.640..536.640rows=1 loops=1) Filter: (col_a > 4996) Rows Removed by Filter: 9992000 Planning Time: 0.285 ms Execution Time: 536.681 ms > > On Apr 2, 2025, at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Manikandan Swaminathan <maniswami23@gmail.com> writes: >> 1) Why is the query currently picking the poorly performing index? > > Because the planner thinks that one will be cheaper, as you can see by > comparing the cost estimates in EXPLAIN. It's wrong, but this is a > hard problem to estimate well. Especially when the behavior depends > on a correlation between columns that the planner knows nothing about. > >> 2) Why would the index you suggested, (col_b, col_a), perform better than (col_a, col_b)? I would’ve expected the filteron col_a to come first, followed by the aggregate on col_b. In my mind, it needs to find rows matching the col_a conditionbefore calculating the MIN(col_b), and I assumed it would traverse the B-tree accordingly. > > The idea the planner is using is "scan the index in order (that is, > in col_b order) until you find the first row satisfying the other > constraints (that is, the col_a condition). Then that row's col_b > value is the correct MIN(), and you can stop." Since it knows nothing > of the cross-column correlation, its estimate of how many rows it'll > have to scan through is overly optimistic. But it knows that the > other way involves scanning a whole lot of the index --- there's no > chance of stopping early --- so that's estimated as higher-cost. > > The index I suggested on (col_b, col_a) is amenable to this same > plan shape, since col_b is still the major sort column. The > reason it wins is that the col_a condition can be checked in the > index without having to visit the heap, thus eliminating a lot of > random access to the heap. > >> 3) Why does the planner choose the better-performing (col_a, col_b) index when the filter is col_a > 5000, but switchto the slower (col_b) index when the filter is not at the edge of the range, like col_a > 4996? > > At some point, as less and less of the col_a-major index would need to > be scanned, there's a crossover in the cost estimates for the two ways > of doing this. I would not have cared to predict where the crossover > is, but you evidently found it empirically. > >> For reference, here’s the query plan when filtering for col_a > 5000. It uses the correct index on (col_a, col_b). > > You would do a lot better to approach this without rigid notions of > which is the "correct" index. All of the ones we've discussed are > potentially usable for this query, and they all have different cost > curves depending on how selective the col_a condition is. Even the > index on col_b alone could potentially be the best, because it'll be > smaller than the two-column indexes. So if the col_a condition is > very unselective then it's (at least in theory) possible that that > would be the best choice. > > regards, tom lane
pgsql-general by date: