Thread: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

Hello list,

INTRO

I have a huge (multi-billion rows) table partitioned into 1000 partitions.
Around half of the partitions are full and the rest are empty, created in
advance ready to receive future incoming data. Postgres is 16.2.
Here are the relevant parts of the schema:

> \d test_runs_raw
                                  Partitioned table "public.test_runs_raw"
       Column       |            Type             | Collation | Nullable |
Default
-------------------+-----------------------------+-----------+----------+----------------------------------
  run_n             | bigint                      |           | not null | generated by default as identity
  test_case_n       | smallint                    |           | not null |
  workitem_n        | integer                     |           | not null |
  test_resulttype_n | smallint                    |           |          |
Partition key: RANGE (workitem_n)
Indexes:
     "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n)

Each partition is made to keep entries with workitem_n in ranges
(0,20k), (20k,40k) and so on (k = kilo) up to 20000k.


PROBLEM

I noticed that the following query is very very slow (too long to wait for
it to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

What is remarkable, is that in 998 out of 1000 table scans it involves,
the planner does not use the index. Instead it chooses a sequential scan.
Here is the output from EXPLAIN:

  Limit  (cost=853891608.79..853891608.99 rows=10 width=4)
    ->  Unique  (cost=853891608.79..853891612.79 rows=200 width=4)
          ->  Sort  (cost=853891608.79..853891610.79 rows=800 width=4)               Sort Key: test_runs_raw.workitem_n
DESC
                ->  Gather  (cost=853891488.22..853891570.22 rows=800 width=4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=853890488.22..853890490.22 rows=200 width=4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=0.00..813118117.30 rows=16308948365 width=4)
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on
test_runs_raw__part_max9600ktest_runs_raw_480  (cost=0.57..1597355.10 rows=33623320 width=4) 
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on
test_runs_raw__part_max10140ktest_runs_raw_507  (cost=0.57..1210795.63 rows=25793672 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475
(cost=0.00..3037793.12rows=64121612 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559
(cost=0.00..2918875.90rows=61612190 width=4) 
[ ... 996 more sequential scans ... ]

If I remove DISTINCT then the plan changes dramatically and it runs
instantaneously:

  Limit  (cost=363.84..367.30 rows=10 width=4)
    ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
          ->  Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k
test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) 
          ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k
test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) 
          ->  Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k
test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) 
          ->  Index Only Scan Backward using test_runs_raw__part_max19940k_pkey on test_runs_raw__part_max19940k
test_runs_raw_997 (cost=0.12..2.34 rows=1 width=4) 
[ ... 996 more index scans ... ]

Notice how in the last plan there is no parallel scanning. Instead the
partitions are scanned sequentially, *in proper order*,
so that the plan execution stops after reading the first
10 rows in the first non-empty partition.

Why can't the same be done with DISTINCT?
Please note that the workitem_n value range is well spread into in range
(0,13M) and the table has been gradually filled within one year, so I'm
assuming the vacuum worker has worked long enough to build sane statistics
(not sure how to verify that).


REMARKS

1. I tried reproducing the problem on an artificial table with few
    partitions and few values, but I couldn't. Both queries execute fast,
    and the planner is always choosing a non-parallel index-only scan.

2. Among testing changes to various settings, I just noticed that setting
    max_parallel_workers_per_gather to 0 (from the original value of 4)
    fixes the issue! On the original huge table, disabling parallelism
    actually makes the query infinitely faster and it returns within 1s! Is
    this a bug in the planner?


Thank you,
Dimitris




On Fri, 10 May 2024, Dimitrios Apostolou wrote:

> I noticed that the following query is very very slow (too long to wait for it
> to finish):
>
> SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

Update: even the simplest SELECT DISTINCT query shows similar behaviour:

EXPLAIN
SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

  Limit  (cost=724518979.52..724518979.92 rows=10 width=4)
    ->  Unique  (cost=724518979.52..724518987.52 rows=200 width=4)
          ->  Sort  (cost=724518979.52..724518983.52 rows=1600 width=4)               Sort Key:
test_runs_raw.workitem_n
                ->  Gather  (cost=724518732.37..724518894.37 rows=1600 width=4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=724517732.37..724517734.37 rows=200 width=4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=0.00..704131546.90 rows=8154474186 width=4)
                                  ->  Parallel Index Only Scan using test_runs_raw__part_max9600k_pkey on
test_runs_raw__part_max9600ktest_runs_raw_480  (cost=0.57..1429238.50 rows=16811660 width=4) 
                                  ->  Parallel Index Only Scan using test_runs_raw__part_max10140k_pkey on
test_runs_raw__part_max10140ktest_runs_raw_507  (cost=0.57..1081827.27 rows=12896836 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475
(cost=0.00..2717185.06rows=32060806 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559
(cost=0.00..2610814.95rows=30806095 width=4) 


It also takes ages to return, so I have to interrupt it.

I believe it should exit early, as soon as it finds 10 distinct values
(which should be rather easy even with parallel seqscans, given the
pattern followed when inserting the data).


Thanks,
Dimitris




On Fri, 10 May 2024, Dimitrios Apostolou wrote:

> On Fri, 10 May 2024, Dimitrios Apostolou wrote:
>
> Update: even the simplest SELECT DISTINCT query shows similar behaviour:

Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:

  Limit  (cost=363.84..1134528847.47 rows=10 width=4)
    ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
          ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
                ->  Index Only Scan using test_runs_raw__part_max20k_pkey on test_runs_raw__part_max20k test_runs_raw_1
(cost=0.12..2.34 rows=1 width=4) 
                ->  Index Only Scan using test_runs_raw__part_max40k_pkey on test_runs_raw__part_max40k test_runs_raw_2
(cost=0.12..2.34 rows=1 width=4) 
[...]
                ->  Index Only Scan using test_runs_raw__part_max1780k_pkey on test_runs_raw__part_max1780k
test_runs_raw_89 (cost=0.57..53587294.65 rows=106088160 width=4) 
                ->  Index Only Scan using test_runs_raw__part_max1800k_pkey on test_runs_raw__part_max1800k
test_runs_raw_90 (cost=0.57..98943539.74 rows=96214080 width=4) 
                ->  Index Only Scan using test_runs_raw__part_max1820k_pkey on test_runs_raw__part_max1820k
test_runs_raw_91 (cost=0.57..97495653.34 rows=193248960 width=4) 
                ->  Index Only Scan using test_runs_raw__part_max1840k_pkey on test_runs_raw__part_max1840k
test_runs_raw_92 (cost=0.57..110205205.07 rows=218440928 width=4) 
                ->  Index Only Scan using test_runs_raw__part_max1860k_pkey on test_runs_raw__part_max1860k
test_runs_raw_93 (cost=0.57..50164056.28 rows=99431760 width=4) 
[...]


The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).

This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


Dimitris




Dimitrios Apostolou <jimis@gmx.net> writes:
> Further digging into this simple query, if I force the non-parallel plan
> by setting max_parallel_workers_per_gather TO 0, I see that the query
> planner comes up with a cost much higher:

>   Limit  (cost=363.84..1134528847.47 rows=10 width=4)
>     ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
>           ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
> ...

> The total cost on the 1st line (cost=363.84..1134528847.47) has a much
> higher upper limit than the total cost when
> max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
> This explains the planner's choice. But I wonder why the cost estimation
> is so far away from reality.

I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.

Possibly an explicit ANALYZE on the partitioned table would help.

            regards, tom lane



On Fri, 10 May 2024, Tom Lane wrote:

> Dimitrios Apostolou <jimis@gmx.net> writes:
>> Further digging into this simple query, if I force the non-parallel plan
>> by setting max_parallel_workers_per_gather TO 0, I see that the query
>> planner comes up with a cost much higher:
>
>>   Limit  (cost=363.84..1134528847.47 rows=10 width=4)
>>     ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
>>           ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
>> ...
>
>> The total cost on the 1st line (cost=363.84..1134528847.47) has a much
>> higher upper limit than the total cost when
>> max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
>> This explains the planner's choice. But I wonder why the cost estimation
>> is so far away from reality.
>
> I'd say the blame lies with that (probably-default) estimate of
> just 200 distinct rows.  That means the planner expects to have
> to read about 5% (10/200) of the tables to get the result, and
> that's making fast-start plans look bad.

Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.

In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?

>
> Possibly an explicit ANALYZE on the partitioned table would help.

Thanks, I'll save the ANALYZE as the last step; I feel it's a good
opportunity to figure out more details about how postgres works. Plus I
expect ANALYZE to last a couple of days, so I should first find quiet time
for that. :-)

Dimitris



Dimitrios Apostolou <jimis@gmx.net> writes:
> On Fri, 10 May 2024, Tom Lane wrote:
>> I'd say the blame lies with that (probably-default) estimate of
>> just 200 distinct rows.  That means the planner expects to have
>> to read about 5% (10/200) of the tables to get the result, and
>> that's making fast-start plans look bad.

> In any case, even after the planner decides to execute the terrible plan
> with the parallel seqscans, why doesn't it finish right when it finds 10
> distinct values?

That plan can't emit anything at all till it finishes the Sort.

I do kind of wonder why it's producing both a hashagg and a Unique
step --- seems like it should do one or the other.

> Thanks, I'll save the ANALYZE as the last step; I feel it's a good
> opportunity to figure out more details about how postgres works. Plus I
> expect ANALYZE to last a couple of days, so I should first find quiet time
> for that. :-)

It really should not take too long --- it reads a sample, not the
whole table.

            regards, tom lane



On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou <jimis@gmx.net> wrote:
> Indeed that's an awful estimate, the table has more than 1M of unique
> values in that column. Looking into pg_stat_user_tables, I can't see the
> partitions having been vacuum'd or analyzed at all. I think they should
> have been auto-analyzed, since they get a ton of INSERTs
> (no deletes/updates though) and I have the default autovacuum settings.
> Could it be that autovacuum starts, but never
> finishes? I can't find something in the logs.

It's not the partitions getting analyzed you need to worry about for
an ndistinct estimate on the partitioned table. It's auto-analyze or
ANALYZE on the partitioned table itself that you should care about.

If you look at [1], it says "Tuples changed in partitions and
inheritance children do not trigger analyze on the parent table."

> In any case, even after the planner decides to execute the terrible plan
> with the parallel seqscans, why doesn't it finish right when it finds 10
> distinct values?

It will. It's just that Sorting requires fetching everything from its subnode.

David

[1] https://www.postgresql.org/docs/16/routine-vacuuming.html#VACUUM-FOR-STATISTICS



On Sat, 11 May 2024 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I do kind of wonder why it's producing both a hashagg and a Unique
> step --- seems like it should do one or the other.

It still needs to make the duplicate groups from parallel workers unique.

David



On Sat, 11 May 2024, David Rowley wrote:

> On Sat, 11 May 2024 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do kind of wonder why it's producing both a hashagg and a Unique
>> step --- seems like it should do one or the other.
>
> It still needs to make the duplicate groups from parallel workers unique.

Range partitioning of the table guarantees that, since the ranges are not
overlapping.


Dimitris




On Sat, 11 May 2024, David Rowley wrote:

> On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou <jimis@gmx.net> wrote:
>> Indeed that's an awful estimate, the table has more than 1M of unique
>> values in that column. Looking into pg_stat_user_tables, I can't see the
>> partitions having been vacuum'd or analyzed at all. I think they should
>> have been auto-analyzed, since they get a ton of INSERTs
>> (no deletes/updates though) and I have the default autovacuum settings.
>> Could it be that autovacuum starts, but never
>> finishes? I can't find something in the logs.
>
> It's not the partitions getting analyzed you need to worry about for
> an ndistinct estimate on the partitioned table. It's auto-analyze or
> ANALYZE on the partitioned table itself that you should care about.
>
> If you look at [1], it says "Tuples changed in partitions and
> inheritance children do not trigger analyze on the parent table."

Thanks

>
>> In any case, even after the planner decides to execute the terrible plan
>> with the parallel seqscans, why doesn't it finish right when it finds 10
>> distinct values?
>
> It will. It's just that Sorting requires fetching everything from its subnode.

Isn't it plain wrong to have a sort step in the plan than? The different
partitions contain different value ranges with no overlap, and the last
query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
clause on an indexed column.

Even with bad estimates, even with seq scan instead of index scan, the
plan should be such that it concludes all parallel work as soon as it
finds the 10 distinct values. And this is actually achieved if I disable
parallel plans. Could it be a bug in the parallel plan generation?


Dimitris




On Mon, 13 May 2024, Dimitrios Apostolou wrote:

> On Sat, 11 May 2024, David Rowley wrote:
>
>>  On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou <jimis@gmx.net> wrote:
>>>  Indeed that's an awful estimate, the table has more than 1M of unique
>>>  values in that column. Looking into pg_stat_user_tables, I can't see the
>>>  partitions having been vacuum'd or analyzed at all. I think they should
>>>  have been auto-analyzed, since they get a ton of INSERTs
>>>  (no deletes/updates though) and I have the default autovacuum settings.
>>>  Could it be that autovacuum starts, but never
>>>  finishes? I can't find something in the logs.
>>
>>  It's not the partitions getting analyzed you need to worry about for
>>  an ndistinct estimate on the partitioned table. It's auto-analyze or
>>  ANALYZE on the partitioned table itself that you should care about.
>>
>>  If you look at [1], it says "Tuples changed in partitions and
>>  inheritance children do not trigger analyze on the parent table."
>
> Thanks

Do I read that correctly, that I have to setup cron jobs to manually
analyze partitioned tables?


Dimitris




On Tue, 14 May 2024 at 00:28, Dimitrios Apostolou <jimis@gmx.net> wrote:
>
> On Sat, 11 May 2024, David Rowley wrote:
>
> > On Sat, 11 May 2024 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I do kind of wonder why it's producing both a hashagg and a Unique
> >> step --- seems like it should do one or the other.
> >
> > It still needs to make the duplicate groups from parallel workers unique.
>
> Range partitioning of the table guarantees that, since the ranges are not
> overlapping.

That assumes the Append won't ever use > 1 worker per subnode, but
that's not the case for your plan as the subnodes are "Parallel".
That means all the workers could be working on the same subnode which
could result in one group being split between 2 or more workers.

Parallel Append can also run in a way that the Append child nodes will
only get 1 worker each.  However, even if that were the case for your
plan, we have no code that would skip the final aggregate phase when
the DISTINCT / GROUP contains all of the partition key columns.

David



On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou <jimis@gmx.net> wrote:
>
> On Sat, 11 May 2024, David Rowley wrote:
> > It will. It's just that Sorting requires fetching everything from its subnode.
>
> Isn't it plain wrong to have a sort step in the plan than? The different
> partitions contain different value ranges with no overlap, and the last
> query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
> clause on an indexed column.

The query does contain an ORDER BY, so if the index is not chosen to
provide pre-sorted input, then something has to put the results in the
correct order before the LIMIT is applied.

> Even with bad estimates, even with seq scan instead of index scan, the
> plan should be such that it concludes all parallel work as soon as it
> finds the 10 distinct values. And this is actually achieved if I disable
> parallel plans. Could it be a bug in the parallel plan generation?

If you were to put the n_distinct_inherited estimate back to 200 and
disable sort, you should see the costs are higher for the index plan.
If that's not the case then there might be a bug.  It seems more
likely that due to the n_distinct estimate being so low that the
planner thought that a large enough fraction of the rows needed to be
read and that made the non-index plan appear cheaper.

I'd be interested in seeing what the costs are for the index plan. I
think the following will give you that (untested):

alter table test_runs_raw alter column workitem_n set
(n_distinct_inherited=200);
analyze test_runs_raw;
set enable_sort=0;
explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY
workitem_n DESC LIMIT 10;

-- undo
alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=-1);
reset enable_sort;

David



On Tue, 14 May 2024 at 00:46, Dimitrios Apostolou <jimis@gmx.net> wrote:
>
> On Mon, 13 May 2024, Dimitrios Apostolou wrote:
>
> > On Sat, 11 May 2024, David Rowley wrote:
> >>  If you look at [1], it says "Tuples changed in partitions and
> >>  inheritance children do not trigger analyze on the parent table."
> >

> Do I read that correctly, that I have to setup cron jobs to manually
> analyze partitioned tables?

It means that auto-analyze won't touch it.  Periodically doing an
ANALYZE on the partitioned table is probably a good idea.

David



On Tue, 14 May 2024, David Rowley wrote:

> On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou <jimis@gmx.net> wrote:
>>
>> On Sat, 11 May 2024, David Rowley wrote:
>>> It will. It's just that Sorting requires fetching everything from its subnode.
>>
>> Isn't it plain wrong to have a sort step in the plan than? The different
>> partitions contain different value ranges with no overlap, and the last
>> query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
>> clause on an indexed column.
>
> The query does contain an ORDER BY, so if the index is not chosen to
> provide pre-sorted input, then something has to put the results in the
> correct order before the LIMIT is applied.

The last query I tried was:

SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

See my message at

[1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Will re-check things and report back with further debugging info you asked
for later today.


Dimitris





On Tue, 14 May 2024, David Rowley wrote:

> That assumes the Append won't ever use > 1 worker per subnode, but
> that's not the case for your plan as the subnodes are "Parallel".
> That means all the workers could be working on the same subnode which
> could result in one group being split between 2 or more workers.

Didn't think of that, makes sense!

> Parallel Append can also run in a way that the Append child nodes will
> only get 1 worker each.

How can I tell which case it is, from the EXPLAIN output (for example
the output at [1]) ?

[1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Dimitris




On Tue, 14 May 2024 at 01:52, Dimitrios Apostolou <jimis@gmx.net> wrote:
>
> On Tue, 14 May 2024, David Rowley wrote:
> > The query does contain an ORDER BY, so if the index is not chosen to
> > provide pre-sorted input, then something has to put the results in the
> > correct order before the LIMIT is applied.
>
> The last query I tried was:
>
> SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

I was looking at the original query.   In that case, we have 2 ways to
remove duplicate rows with DISTINCT, "Hash Aggregate" and "Sort" ->
"Unique". Both of these will consume all of their input rows before
outputting any rows.

DISTINCT with LIMIT is a special case that we don't have a good
operator for.  In theory, we could have some "Hash Distinct" node type
that was less eager to consume all of its input rows.  When invoked
"Hash Distinct" could consume input rows until it found one that
didn't exist in the hash table.  I've no idea how that would work when
we exceed work_mem.  However, most queries with a LIMIT will have an
ORDER BY, so such a node likely wouldn't get much use.

David



On Tue, 14 May 2024 at 02:07, Dimitrios Apostolou <jimis@gmx.net> wrote:
>
> On Tue, 14 May 2024, David Rowley wrote:
> > Parallel Append can also run in a way that the Append child nodes will
> > only get 1 worker each.
>
> How can I tell which case it is, from the EXPLAIN output (for example
> the output at [1]) ?

IIRC, the planner does prefer to use Parallel aware child Paths when
creating a Parallel Append.  Given equivalent costs, there's no
advantage to it choosing a non-parallel aware Path.  The planner does
not have any optimisations that that would enable.  However, it is
possible that the planner *could* generate these. All the Append
subpaths would just have to all be parallel safe but not parallel
aware. You could identify them in EXPLAIN by seeing a "Parallel
Append" without the "Parallel" in front of the node names in any of
the Parallel Append's subpaths.

David




On Fri, 10 May 2024, Tom Lane wrote:

> Dimitrios Apostolou <jimis@gmx.net> writes:
>> Further digging into this simple query, if I force the non-parallel plan
>> by setting max_parallel_workers_per_gather TO 0, I see that the query
>> planner comes up with a cost much higher:
>
>>   Limit  (cost=363.84..1134528847.47 rows=10 width=4)
>>     ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
>>           ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
>> ...
>
>> The total cost on the 1st line (cost=363.84..1134528847.47) has a much
>> higher upper limit than the total cost when
>> max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
>> This explains the planner's choice. But I wonder why the cost estimation
>> is so far away from reality.
>
> I'd say the blame lies with that (probably-default) estimate of
> just 200 distinct rows.  That means the planner expects to have
> to read about 5% (10/200) of the tables to get the result, and
> that's making fast-start plans look bad.
>
> Possibly an explicit ANALYZE on the partitioned table would help.

It took long but if finished:

ANALYZE
Time: 19177398.025 ms (05:19:37.398)

And it made a difference indeed, the serial plan is chosen now:

EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;
  Limit  (cost=364.29..1835512.29 rows=10 width=4)
    ->  Unique  (cost=364.29..22701882164.56 rows=123706 width=4)
          ->  Append  (cost=364.29..22538472401.60 rows=65363905182 width=4)
                ->  Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k
test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) 
                ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k
test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) 
                ->  Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k
test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) 
[...]
                ->  Index Only Scan Backward using test_runs_raw__part_max12460k_pkey on test_runs_raw__part_max12460k
test_runs_raw_623 (cost=0.57..12329614.53 rows=121368496 width=4) 
                ->  Index Only Scan Backward using test_runs_raw__part_max12440k_pkey on test_runs_raw__part_max12440k
test_runs_raw_622 (cost=0.57..5180832.16 rows=184927264 width=4) 
                ->  Index Only Scan Backward using test_runs_raw__part_max12420k_pkey on test_runs_raw__part_max12420k
test_runs_raw_621 (cost=0.57..4544964.21 rows=82018824 width=4) 
[...]

Overall I think there are two issues that postgres could handle better
here:

1. Avoid the need for manual ANALYZE on partitioned table

2. Create a different parallel plan, one that can exit early, when the
    LIMIT is proportionally low. I feel the partitions could be
    parallel-scanned in-order, so that the whole thing stops when one
    partition has been read.

Thank you!
Dimitris




On Tue, 14 May 2024, Dimitrios Apostolou wrote:
>
> It took long but if finished:
>
> ANALYZE
> Time: 19177398.025 ms (05:19:37.398)

I see now that default_statistics_target is globally set to 500, so this
is probably the reason it took so long. I guess with the default of 100,
it would take approximately one hour. This is much better to have in a
cron job. :-)

Dimitris





On Tue, 14 May 2024, David Rowley wrote:
>
> If you were to put the n_distinct_inherited estimate back to 200 and
> disable sort, you should see the costs are higher for the index plan.
> If that's not the case then there might be a bug.  It seems more
> likely that due to the n_distinct estimate being so low that the
> planner thought that a large enough fraction of the rows needed to be
> read and that made the non-index plan appear cheaper.
>
> I'd be interested in seeing what the costs are for the index plan. I
> think the following will give you that (untested):
>
> alter table test_runs_raw alter column workitem_n set
> (n_distinct_inherited=200);
> analyze test_runs_raw;

I had to stop this step because it was taking too long going through all
partitions again. But it seems it had the desired effect.

> set enable_sort=0;
> explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

It chooses the non-parallel index plan:

  Limit  (cost=365.17..1135517462.36 rows=10 width=4)
    ->  Unique  (cost=365.17..22710342308.83 rows=200 width=4)
          ->  Append  (cost=365.17..22546660655.46 rows=65472661350 width=4)
                ->  Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k
test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) 
                ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k
test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) 
[... only index scans follow]

LIMIT 100 goes for the parallel seqscan plan, that even contains a sort!
But it seams to me that the extra upper level HashAggregate step raises
the cost by an order of magnitude, from 800M to 10G, in comparison to
doing (Unique->Sort) - see plan in the next paragraph.

  Limit  (cost=10857220388.76..10857220389.01 rows=100 width=4)
    ->  Sort  (cost=10857220388.76..10857220389.26 rows=200 width=4)
          Sort Key: test_runs_raw.workitem_n DESC
          ->  HashAggregate  (cost=857220379.12..857220381.12 rows=200 width=4)
                Group Key: test_runs_raw.workitem_n
                ->  Gather  (cost=857220295.12..857220377.12 rows=800 width=4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=857219295.12..857219297.12 rows=200 width=4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=0.00..816295259.21 rows=16369614363 width=4)
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on
test_runs_raw__part_max9600ktest_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4) 
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on
test_runs_raw__part_max10140ktest_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475
(cost=0.00..3037800.88rows=64122388 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559
(cost=0.00..2918865.36rows=61611136 width=4) 
[... only seqscans follow]



If I re-enable sort, then it goes for the parallel seqscan plan even with LIMIT 10:

SET SESSION enable_sort TO TRUE;
EXPLAIN  SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

  Limit  (cost=857166256.39..857166256.59 rows=10 width=4)
    ->  Unique  (cost=857166256.39..857166260.39 rows=200 width=4)
          ->  Sort  (cost=857166256.39..857166258.39 rows=800 width=4)
                Sort Key: test_runs_raw.workitem_n DESC
                ->  Gather  (cost=857166135.82..857166217.82 rows=800 width=4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=857165135.82..857165137.82 rows=200 width=4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=0.00..816243567.24 rows=16368627432 width=4)
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on
test_runs_raw__part_max9600ktest_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4) 
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on
test_runs_raw__part_max10140ktest_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475
(cost=0.00..3037800.88rows=64122388 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559
(cost=0.00..2918865.36rows=61611136 width=4) 
[... only seqscans follow]

So in order of higher to lower cost, we have the following alternatives:

1. non-parallel index scan  (800M)
2. parallel seqscan with sort  (1.3G)
3. parallel seqscan without sort but actually has a sort  (10G assuming it's the same as for LIMIT 100)

>
> -- undo
> alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=-1);

I believe I need to set it to 0 to be back to defaults.

> reset enable_sort;
>


Regards,
Dimitris




I have forgotten to mention that I have enable_partitionwise_aggregate=on
in the global settings since the beginning. According to the docs:

> Enables or disables the query planner's use of partitionwise grouping or
> aggregation, which allows grouping or aggregation on partitioned tables
> to be performed separately for each partition.

Reading that, I'd expect to see a separate DISTINCT->LIMIT 10 on every
partition, and then it would be up to independent plans to decide whether
each partition follows a parallel or a serial plan.

Not sure if this plan was checked but rejected because of cost.


Dimitris