Re: Re: fix cost subqueryscan wrong parallel cost - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Re: fix cost subqueryscan wrong parallel cost
Date
Msg-id CA+TgmobxhwpZJDQgbtQOGe8+DfDnSsmf7JqPmtbitTK1vnRdpQ@mail.gmail.com
Whole thread Raw
In response to Re: Re: fix cost subqueryscan wrong parallel cost  ("bucoo@sohu.com" <bucoo@sohu.com>)
List pgsql-hackers
On Fri, Apr 15, 2022 at 6:06 AM bucoo@sohu.com <bucoo@sohu.com> wrote:
> > Generally it should be. But there's no subquery scan visible here.
> I wrote a patch for distinct/union and aggregate support last year(I want restart it again).
> https://www.postgresql.org/message-id/2021091517250848215321%40sohu.com
> If not apply this patch, some parallel paths will naver be selected.

Sure, but that doesn't make the patch correct. The patch proposes
that, when parallelism in use, a subquery scan will produce fewer rows
than when parallelism is not in use, and that's 100% false. Compare
this with the case of a parallel sequential scan. If a table contains
1000 rows, and we scan it with a regular Seq Scan, the Seq Scan will
return 1000 rows.  But if we scan it with a Parallel Seq Scan using
say 4 workers, the number of rows returned in each worker will be
substantially less than 1000, because 1000 is now the *total* number
of rows to be returned across *all* processes, and what we need is the
number of rows returned in *each* process.

The same thing isn't true for a subquery scan. Consider:

Gather
-> Subquery Scan
  -> Parallel Seq Scan

One thing is for sure: the number of rows that will be produced by the
subquery scan in each backend is exactly equal to the number of rows
that the subquery scan receives from its subpath. Parallel Seq Scan
can't just return a row count estimate based on the number of rows in
the table, because those rows are going to be divided among the
workers. But the Subquery Scan doesn't do anything like that. If it
receives let's say 250 rows as input in each worker, it's going to
produce 250 output rows in each worker. Your patch says it's going to
produce fewer than that, and that's wrong, regardless of whether it
gives you the plan you want in this particular case.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Mohammad Zain Abbas
Date:
Subject: GSoC: Database Load Stress Benchmark (2022)
Next
From: Robert Haas
Date:
Subject: Re: BufferAlloc: don't take two simultaneous locks