Thread: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
Hello list, INTRO I have a huge (multi-billion rows) table partitioned into 1000 partitions. Around half of the partitions are full and the rest are empty, created in advance ready to receive future incoming data. Postgres is 16.2. Here are the relevant parts of the schema: > \d test_runs_raw Partitioned table "public.test_runs_raw" Column | Type | Collation | Nullable | Default -------------------+-----------------------------+-----------+----------+---------------------------------- run_n | bigint | | not null | generated by default as identity test_case_n | smallint | | not null | workitem_n | integer | | not null | test_resulttype_n | smallint | | | Partition key: RANGE (workitem_n) Indexes: "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n) Each partition is made to keep entries with workitem_n in ranges (0,20k), (20k,40k) and so on (k = kilo) up to 20000k. PROBLEM I noticed that the following query is very very slow (too long to wait for it to finish): SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; What is remarkable, is that in 998 out of 1000 table scans it involves, the planner does not use the index. Instead it chooses a sequential scan. Here is the output from EXPLAIN: Limit (cost=853891608.79..853891608.99 rows=10 width=4) -> Unique (cost=853891608.79..853891612.79 rows=200 width=4) -> Sort (cost=853891608.79..853891610.79 rows=800 width=4) Sort Key: test_runs_raw.workitem_n DESC -> Gather (cost=853891488.22..853891570.22 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=853890488.22..853890490.22 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..813118117.30 rows=16308948365 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600ktest_runs_raw_480 (cost=0.57..1597355.10 rows=33623320 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140ktest_runs_raw_507 (cost=0.57..1210795.63 rows=25793672 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037793.12rows=64121612 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918875.90rows=61612190 width=4) [ ... 996 more sequential scans ... ] If I remove DISTINCT then the plan changes dramatically and it runs instantaneously: Limit (cost=363.84..367.30 rows=10 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) -> Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19940k_pkey on test_runs_raw__part_max19940k test_runs_raw_997 (cost=0.12..2.34 rows=1 width=4) [ ... 996 more index scans ... ] Notice how in the last plan there is no parallel scanning. Instead the partitions are scanned sequentially, *in proper order*, so that the plan execution stops after reading the first 10 rows in the first non-empty partition. Why can't the same be done with DISTINCT? Please note that the workitem_n value range is well spread into in range (0,13M) and the table has been gradually filled within one year, so I'm assuming the vacuum worker has worked long enough to build sane statistics (not sure how to verify that). REMARKS 1. I tried reproducing the problem on an artificial table with few partitions and few values, but I couldn't. Both queries execute fast, and the planner is always choosing a non-parallel index-only scan. 2. Among testing changes to various settings, I just noticed that setting max_parallel_workers_per_gather to 0 (from the original value of 4) fixes the issue! On the original huge table, disabling parallelism actually makes the query infinitely faster and it returns within 1s! Is this a bug in the planner? Thank you, Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Fri, 10 May 2024, Dimitrios Apostolou wrote: > I noticed that the following query is very very slow (too long to wait for it > to finish): > > SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Update: even the simplest SELECT DISTINCT query shows similar behaviour: EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10; Limit (cost=724518979.52..724518979.92 rows=10 width=4) -> Unique (cost=724518979.52..724518987.52 rows=200 width=4) -> Sort (cost=724518979.52..724518983.52 rows=1600 width=4) Sort Key: test_runs_raw.workitem_n -> Gather (cost=724518732.37..724518894.37 rows=1600 width=4) Workers Planned: 4 -> HashAggregate (cost=724517732.37..724517734.37 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..704131546.90 rows=8154474186 width=4) -> Parallel Index Only Scan using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600ktest_runs_raw_480 (cost=0.57..1429238.50 rows=16811660 width=4) -> Parallel Index Only Scan using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140ktest_runs_raw_507 (cost=0.57..1081827.27 rows=12896836 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..2717185.06rows=32060806 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2610814.95rows=30806095 width=4) It also takes ages to return, so I have to interrupt it. I believe it should exit early, as soon as it finds 10 distinct values (which should be rather easy even with parallel seqscans, given the pattern followed when inserting the data). Thanks, Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Fri, 10 May 2024, Dimitrios Apostolou wrote: > On Fri, 10 May 2024, Dimitrios Apostolou wrote: > > Update: even the simplest SELECT DISTINCT query shows similar behaviour: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47 rows=10 width=4) -> Unique (cost=363.84..22690570036.41 rows=200 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) -> Index Only Scan using test_runs_raw__part_max20k_pkey on test_runs_raw__part_max20k test_runs_raw_1 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan using test_runs_raw__part_max40k_pkey on test_runs_raw__part_max40k test_runs_raw_2 (cost=0.12..2.34 rows=1 width=4) [...] -> Index Only Scan using test_runs_raw__part_max1780k_pkey on test_runs_raw__part_max1780k test_runs_raw_89 (cost=0.57..53587294.65 rows=106088160 width=4) -> Index Only Scan using test_runs_raw__part_max1800k_pkey on test_runs_raw__part_max1800k test_runs_raw_90 (cost=0.57..98943539.74 rows=96214080 width=4) -> Index Only Scan using test_runs_raw__part_max1820k_pkey on test_runs_raw__part_max1820k test_runs_raw_91 (cost=0.57..97495653.34 rows=193248960 width=4) -> Index Only Scan using test_runs_raw__part_max1840k_pkey on test_runs_raw__part_max1840k test_runs_raw_92 (cost=0.57..110205205.07 rows=218440928 width=4) -> Index Only Scan using test_runs_raw__part_max1860k_pkey on test_runs_raw__part_max1860k test_runs_raw_93 (cost=0.57..50164056.28 rows=99431760 width=4) [...] The total cost on the 1st line (cost=363.84..1134528847.47) has a much higher upper limit than the total cost when max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). This explains the planner's choice. But I wonder why the cost estimation is so far away from reality. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Tom Lane
Date:
Dimitrios Apostolou <jimis@gmx.net> writes: > Further digging into this simple query, if I force the non-parallel plan > by setting max_parallel_workers_per_gather TO 0, I see that the query > planner comes up with a cost much higher: > Limit (cost=363.84..1134528847.47 rows=10 width=4) > -> Unique (cost=363.84..22690570036.41 rows=200 width=4) > -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) > ... > The total cost on the 1st line (cost=363.84..1134528847.47) has a much > higher upper limit than the total cost when > max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). > This explains the planner's choice. But I wonder why the cost estimation > is so far away from reality. I'd say the blame lies with that (probably-default) estimate of just 200 distinct rows. That means the planner expects to have to read about 5% (10/200) of the tables to get the result, and that's making fast-start plans look bad. Possibly an explicit ANALYZE on the partitioned table would help. regards, tom lane
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Fri, 10 May 2024, Tom Lane wrote: > Dimitrios Apostolou <jimis@gmx.net> writes: >> Further digging into this simple query, if I force the non-parallel plan >> by setting max_parallel_workers_per_gather TO 0, I see that the query >> planner comes up with a cost much higher: > >> Limit (cost=363.84..1134528847.47 rows=10 width=4) >> -> Unique (cost=363.84..22690570036.41 rows=200 width=4) >> -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) >> ... > >> The total cost on the 1st line (cost=363.84..1134528847.47) has a much >> higher upper limit than the total cost when >> max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). >> This explains the planner's choice. But I wonder why the cost estimation >> is so far away from reality. > > I'd say the blame lies with that (probably-default) estimate of > just 200 distinct rows. That means the planner expects to have > to read about 5% (10/200) of the tables to get the result, and > that's making fast-start plans look bad. Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or analyzed at all. I think they should have been auto-analyzed, since they get a ton of INSERTs (no deletes/updates though) and I have the default autovacuum settings. Could it be that autovacuum starts, but never finishes? I can't find something in the logs. In any case, even after the planner decides to execute the terrible plan with the parallel seqscans, why doesn't it finish right when it finds 10 distinct values? > > Possibly an explicit ANALYZE on the partitioned table would help. Thanks, I'll save the ANALYZE as the last step; I feel it's a good opportunity to figure out more details about how postgres works. Plus I expect ANALYZE to last a couple of days, so I should first find quiet time for that. :-) Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Tom Lane
Date:
Dimitrios Apostolou <jimis@gmx.net> writes: > On Fri, 10 May 2024, Tom Lane wrote: >> I'd say the blame lies with that (probably-default) estimate of >> just 200 distinct rows. That means the planner expects to have >> to read about 5% (10/200) of the tables to get the result, and >> that's making fast-start plans look bad. > In any case, even after the planner decides to execute the terrible plan > with the parallel seqscans, why doesn't it finish right when it finds 10 > distinct values? That plan can't emit anything at all till it finishes the Sort. I do kind of wonder why it's producing both a hashagg and a Unique step --- seems like it should do one or the other. > Thanks, I'll save the ANALYZE as the last step; I feel it's a good > opportunity to figure out more details about how postgres works. Plus I > expect ANALYZE to last a couple of days, so I should first find quiet time > for that. :-) It really should not take too long --- it reads a sample, not the whole table. regards, tom lane
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
David Rowley
Date:
On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou <jimis@gmx.net> wrote: > Indeed that's an awful estimate, the table has more than 1M of unique > values in that column. Looking into pg_stat_user_tables, I can't see the > partitions having been vacuum'd or analyzed at all. I think they should > have been auto-analyzed, since they get a ton of INSERTs > (no deletes/updates though) and I have the default autovacuum settings. > Could it be that autovacuum starts, but never > finishes? I can't find something in the logs. It's not the partitions getting analyzed you need to worry about for an ndistinct estimate on the partitioned table. It's auto-analyze or ANALYZE on the partitioned table itself that you should care about. If you look at [1], it says "Tuples changed in partitions and inheritance children do not trigger analyze on the parent table." > In any case, even after the planner decides to execute the terrible plan > with the parallel seqscans, why doesn't it finish right when it finds 10 > distinct values? It will. It's just that Sorting requires fetching everything from its subnode. David [1] https://www.postgresql.org/docs/16/routine-vacuuming.html#VACUUM-FOR-STATISTICS
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
David Rowley
Date:
On Sat, 11 May 2024 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I do kind of wonder why it's producing both a hashagg and a Unique > step --- seems like it should do one or the other. It still needs to make the duplicate groups from parallel workers unique. David
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Sat, 11 May 2024, David Rowley wrote: > On Sat, 11 May 2024 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I do kind of wonder why it's producing both a hashagg and a Unique >> step --- seems like it should do one or the other. > > It still needs to make the duplicate groups from parallel workers unique. Range partitioning of the table guarantees that, since the ranges are not overlapping. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Sat, 11 May 2024, David Rowley wrote: > On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou <jimis@gmx.net> wrote: >> Indeed that's an awful estimate, the table has more than 1M of unique >> values in that column. Looking into pg_stat_user_tables, I can't see the >> partitions having been vacuum'd or analyzed at all. I think they should >> have been auto-analyzed, since they get a ton of INSERTs >> (no deletes/updates though) and I have the default autovacuum settings. >> Could it be that autovacuum starts, but never >> finishes? I can't find something in the logs. > > It's not the partitions getting analyzed you need to worry about for > an ndistinct estimate on the partitioned table. It's auto-analyze or > ANALYZE on the partitioned table itself that you should care about. > > If you look at [1], it says "Tuples changed in partitions and > inheritance children do not trigger analyze on the parent table." Thanks > >> In any case, even after the planner decides to execute the terrible plan >> with the parallel seqscans, why doesn't it finish right when it finds 10 >> distinct values? > > It will. It's just that Sorting requires fetching everything from its subnode. Isn't it plain wrong to have a sort step in the plan than? The different partitions contain different value ranges with no overlap, and the last query I posted doesn't even contain an ORDER BY clause, just a DISTINCT clause on an indexed column. Even with bad estimates, even with seq scan instead of index scan, the plan should be such that it concludes all parallel work as soon as it finds the 10 distinct values. And this is actually achieved if I disable parallel plans. Could it be a bug in the parallel plan generation? Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Mon, 13 May 2024, Dimitrios Apostolou wrote: > On Sat, 11 May 2024, David Rowley wrote: > >> On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou <jimis@gmx.net> wrote: >>> Indeed that's an awful estimate, the table has more than 1M of unique >>> values in that column. Looking into pg_stat_user_tables, I can't see the >>> partitions having been vacuum'd or analyzed at all. I think they should >>> have been auto-analyzed, since they get a ton of INSERTs >>> (no deletes/updates though) and I have the default autovacuum settings. >>> Could it be that autovacuum starts, but never >>> finishes? I can't find something in the logs. >> >> It's not the partitions getting analyzed you need to worry about for >> an ndistinct estimate on the partitioned table. It's auto-analyze or >> ANALYZE on the partitioned table itself that you should care about. >> >> If you look at [1], it says "Tuples changed in partitions and >> inheritance children do not trigger analyze on the parent table." > > Thanks Do I read that correctly, that I have to setup cron jobs to manually analyze partitioned tables? Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
David Rowley
Date:
On Tue, 14 May 2024 at 00:28, Dimitrios Apostolou <jimis@gmx.net> wrote: > > On Sat, 11 May 2024, David Rowley wrote: > > > On Sat, 11 May 2024 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I do kind of wonder why it's producing both a hashagg and a Unique > >> step --- seems like it should do one or the other. > > > > It still needs to make the duplicate groups from parallel workers unique. > > Range partitioning of the table guarantees that, since the ranges are not > overlapping. That assumes the Append won't ever use > 1 worker per subnode, but that's not the case for your plan as the subnodes are "Parallel". That means all the workers could be working on the same subnode which could result in one group being split between 2 or more workers. Parallel Append can also run in a way that the Append child nodes will only get 1 worker each. However, even if that were the case for your plan, we have no code that would skip the final aggregate phase when the DISTINCT / GROUP contains all of the partition key columns. David
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
David Rowley
Date:
On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou <jimis@gmx.net> wrote: > > On Sat, 11 May 2024, David Rowley wrote: > > It will. It's just that Sorting requires fetching everything from its subnode. > > Isn't it plain wrong to have a sort step in the plan than? The different > partitions contain different value ranges with no overlap, and the last > query I posted doesn't even contain an ORDER BY clause, just a DISTINCT > clause on an indexed column. The query does contain an ORDER BY, so if the index is not chosen to provide pre-sorted input, then something has to put the results in the correct order before the LIMIT is applied. > Even with bad estimates, even with seq scan instead of index scan, the > plan should be such that it concludes all parallel work as soon as it > finds the 10 distinct values. And this is actually achieved if I disable > parallel plans. Could it be a bug in the parallel plan generation? If you were to put the n_distinct_inherited estimate back to 200 and disable sort, you should see the costs are higher for the index plan. If that's not the case then there might be a bug. It seems more likely that due to the n_distinct estimate being so low that the planner thought that a large enough fraction of the rows needed to be read and that made the non-index plan appear cheaper. I'd be interested in seeing what the costs are for the index plan. I think the following will give you that (untested): alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=200); analyze test_runs_raw; set enable_sort=0; explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; -- undo alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=-1); reset enable_sort; David
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
David Rowley
Date:
On Tue, 14 May 2024 at 00:46, Dimitrios Apostolou <jimis@gmx.net> wrote: > > On Mon, 13 May 2024, Dimitrios Apostolou wrote: > > > On Sat, 11 May 2024, David Rowley wrote: > >> If you look at [1], it says "Tuples changed in partitions and > >> inheritance children do not trigger analyze on the parent table." > > > Do I read that correctly, that I have to setup cron jobs to manually > analyze partitioned tables? It means that auto-analyze won't touch it. Periodically doing an ANALYZE on the partitioned table is probably a good idea. David
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Tue, 14 May 2024, David Rowley wrote: > On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou <jimis@gmx.net> wrote: >> >> On Sat, 11 May 2024, David Rowley wrote: >>> It will. It's just that Sorting requires fetching everything from its subnode. >> >> Isn't it plain wrong to have a sort step in the plan than? The different >> partitions contain different value ranges with no overlap, and the last >> query I posted doesn't even contain an ORDER BY clause, just a DISTINCT >> clause on an indexed column. > > The query does contain an ORDER BY, so if the index is not chosen to > provide pre-sorted input, then something has to put the results in the > correct order before the LIMIT is applied. The last query I tried was: SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10; See my message at [1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net Will re-check things and report back with further debugging info you asked for later today. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Tue, 14 May 2024, David Rowley wrote: > That assumes the Append won't ever use > 1 worker per subnode, but > that's not the case for your plan as the subnodes are "Parallel". > That means all the workers could be working on the same subnode which > could result in one group being split between 2 or more workers. Didn't think of that, makes sense! > Parallel Append can also run in a way that the Append child nodes will > only get 1 worker each. How can I tell which case it is, from the EXPLAIN output (for example the output at [1]) ? [1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
David Rowley
Date:
On Tue, 14 May 2024 at 01:52, Dimitrios Apostolou <jimis@gmx.net> wrote: > > On Tue, 14 May 2024, David Rowley wrote: > > The query does contain an ORDER BY, so if the index is not chosen to > > provide pre-sorted input, then something has to put the results in the > > correct order before the LIMIT is applied. > > The last query I tried was: > > SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10; I was looking at the original query. In that case, we have 2 ways to remove duplicate rows with DISTINCT, "Hash Aggregate" and "Sort" -> "Unique". Both of these will consume all of their input rows before outputting any rows. DISTINCT with LIMIT is a special case that we don't have a good operator for. In theory, we could have some "Hash Distinct" node type that was less eager to consume all of its input rows. When invoked "Hash Distinct" could consume input rows until it found one that didn't exist in the hash table. I've no idea how that would work when we exceed work_mem. However, most queries with a LIMIT will have an ORDER BY, so such a node likely wouldn't get much use. David
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
David Rowley
Date:
On Tue, 14 May 2024 at 02:07, Dimitrios Apostolou <jimis@gmx.net> wrote: > > On Tue, 14 May 2024, David Rowley wrote: > > Parallel Append can also run in a way that the Append child nodes will > > only get 1 worker each. > > How can I tell which case it is, from the EXPLAIN output (for example > the output at [1]) ? IIRC, the planner does prefer to use Parallel aware child Paths when creating a Parallel Append. Given equivalent costs, there's no advantage to it choosing a non-parallel aware Path. The planner does not have any optimisations that that would enable. However, it is possible that the planner *could* generate these. All the Append subpaths would just have to all be parallel safe but not parallel aware. You could identify them in EXPLAIN by seeing a "Parallel Append" without the "Parallel" in front of the node names in any of the Parallel Append's subpaths. David
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Fri, 10 May 2024, Tom Lane wrote: > Dimitrios Apostolou <jimis@gmx.net> writes: >> Further digging into this simple query, if I force the non-parallel plan >> by setting max_parallel_workers_per_gather TO 0, I see that the query >> planner comes up with a cost much higher: > >> Limit (cost=363.84..1134528847.47 rows=10 width=4) >> -> Unique (cost=363.84..22690570036.41 rows=200 width=4) >> -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) >> ... > >> The total cost on the 1st line (cost=363.84..1134528847.47) has a much >> higher upper limit than the total cost when >> max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). >> This explains the planner's choice. But I wonder why the cost estimation >> is so far away from reality. > > I'd say the blame lies with that (probably-default) estimate of > just 200 distinct rows. That means the planner expects to have > to read about 5% (10/200) of the tables to get the result, and > that's making fast-start plans look bad. > > Possibly an explicit ANALYZE on the partitioned table would help. It took long but if finished: ANALYZE Time: 19177398.025 ms (05:19:37.398) And it made a difference indeed, the serial plan is chosen now: EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Limit (cost=364.29..1835512.29 rows=10 width=4) -> Unique (cost=364.29..22701882164.56 rows=123706 width=4) -> Append (cost=364.29..22538472401.60 rows=65363905182 width=4) -> Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) [...] -> Index Only Scan Backward using test_runs_raw__part_max12460k_pkey on test_runs_raw__part_max12460k test_runs_raw_623 (cost=0.57..12329614.53 rows=121368496 width=4) -> Index Only Scan Backward using test_runs_raw__part_max12440k_pkey on test_runs_raw__part_max12440k test_runs_raw_622 (cost=0.57..5180832.16 rows=184927264 width=4) -> Index Only Scan Backward using test_runs_raw__part_max12420k_pkey on test_runs_raw__part_max12420k test_runs_raw_621 (cost=0.57..4544964.21 rows=82018824 width=4) [...] Overall I think there are two issues that postgres could handle better here: 1. Avoid the need for manual ANALYZE on partitioned table 2. Create a different parallel plan, one that can exit early, when the LIMIT is proportionally low. I feel the partitions could be parallel-scanned in-order, so that the whole thing stops when one partition has been read. Thank you! Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Tue, 14 May 2024, Dimitrios Apostolou wrote: > > It took long but if finished: > > ANALYZE > Time: 19177398.025 ms (05:19:37.398) I see now that default_statistics_target is globally set to 500, so this is probably the reason it took so long. I guess with the default of 100, it would take approximately one hour. This is much better to have in a cron job. :-) Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
On Tue, 14 May 2024, David Rowley wrote: > > If you were to put the n_distinct_inherited estimate back to 200 and > disable sort, you should see the costs are higher for the index plan. > If that's not the case then there might be a bug. It seems more > likely that due to the n_distinct estimate being so low that the > planner thought that a large enough fraction of the rows needed to be > read and that made the non-index plan appear cheaper. > > I'd be interested in seeing what the costs are for the index plan. I > think the following will give you that (untested): > > alter table test_runs_raw alter column workitem_n set > (n_distinct_inherited=200); > analyze test_runs_raw; I had to stop this step because it was taking too long going through all partitions again. But it seems it had the desired effect. > set enable_sort=0; > explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; It chooses the non-parallel index plan: Limit (cost=365.17..1135517462.36 rows=10 width=4) -> Unique (cost=365.17..22710342308.83 rows=200 width=4) -> Append (cost=365.17..22546660655.46 rows=65472661350 width=4) -> Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) [... only index scans follow] LIMIT 100 goes for the parallel seqscan plan, that even contains a sort! But it seams to me that the extra upper level HashAggregate step raises the cost by an order of magnitude, from 800M to 10G, in comparison to doing (Unique->Sort) - see plan in the next paragraph. Limit (cost=10857220388.76..10857220389.01 rows=100 width=4) -> Sort (cost=10857220388.76..10857220389.26 rows=200 width=4) Sort Key: test_runs_raw.workitem_n DESC -> HashAggregate (cost=857220379.12..857220381.12 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Gather (cost=857220295.12..857220377.12 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=857219295.12..857219297.12 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..816295259.21 rows=16369614363 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600ktest_runs_raw_480 (cost=0.57..1597356.30 rows=33623360 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140ktest_runs_raw_507 (cost=0.57..1210806.37 rows=25794030 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037800.88rows=64122388 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918865.36rows=61611136 width=4) [... only seqscans follow] If I re-enable sort, then it goes for the parallel seqscan plan even with LIMIT 10: SET SESSION enable_sort TO TRUE; EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Limit (cost=857166256.39..857166256.59 rows=10 width=4) -> Unique (cost=857166256.39..857166260.39 rows=200 width=4) -> Sort (cost=857166256.39..857166258.39 rows=800 width=4) Sort Key: test_runs_raw.workitem_n DESC -> Gather (cost=857166135.82..857166217.82 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=857165135.82..857165137.82 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..816243567.24 rows=16368627432 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600ktest_runs_raw_480 (cost=0.57..1597356.30 rows=33623360 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140ktest_runs_raw_507 (cost=0.57..1210806.37 rows=25794030 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037800.88rows=64122388 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918865.36rows=61611136 width=4) [... only seqscans follow] So in order of higher to lower cost, we have the following alternatives: 1. non-parallel index scan (800M) 2. parallel seqscan with sort (1.3G) 3. parallel seqscan without sort but actually has a sort (10G assuming it's the same as for LIMIT 100) > > -- undo > alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=-1); I believe I need to set it to 0 to be back to defaults. > reset enable_sort; > Regards, Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
From
Dimitrios Apostolou
Date:
I have forgotten to mention that I have enable_partitionwise_aggregate=on in the global settings since the beginning. According to the docs: > Enables or disables the query planner's use of partitionwise grouping or > aggregation, which allows grouping or aggregation on partitioned tables > to be performed separately for each partition. Reading that, I'd expect to see a separate DISTINCT->LIMIT 10 on every partition, and then it would be up to independent plans to decide whether each partition follows a parallel or a serial plan. Not sure if this plan was checked but rejected because of cost. Dimitris