Home > mailing lists

Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: Parallel INSERT (INTO ... SELECT ...)
Date	October 9, 2020 12:04:38
Msg-id	CAA4eK1Ko8KFOXhK6CkgKXxNVA0uzJEpTbEcWOH3scfpu5GR=cw@mail.gmail.com Whole thread
In response to	Re: Parallel INSERT (INTO ... SELECT ...) (Greg Nancarrow <gregn4422@gmail.com>)
List	pgsql-hackers

Tree view

On Fri, Oct 9, 2020 at 4:28 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 8:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Oct 9, 2020 at 2:37 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > >
> > > Speaking of costing, I'm not sure I really agree with the current
> > > costing of a Gather node. Just considering a simple Parallel SeqScan
> > > case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of
> > > Gather cost always completely drowns out any other path costs when a
> > > large number of rows are involved (at least with default
> > > parallel-related GUC values), such that Parallel SeqScan would never
> > > be the cheapest path. This linear relationship in the costing based on
> > > the rows and a parallel_tuple_cost doesn't make sense to me. Surely
> > > after a certain amount of rows, the overhead of launching workers will
> > > be out-weighed by the benefit of their parallel work, such that the
> > > more rows, the more likely a Parallel SeqScan will benefit.
> > >
> >
> > That will be true for the number of rows/pages we need to scan not for
> > the number of tuples we need to return as a result. The formula here
> > considers the number of rows the parallel scan will return and the
> > more the number of rows each parallel node needs to pass via shared
> > memory to gather node the more costly it will be.
> >
> > We do consider the total pages we need to scan in
> > compute_parallel_worker() where we use a logarithmic formula to
> > determine the number of workers.
> >
>
> Despite all the best intentions, the current costings seem to be
> geared towards selection of a non-parallel plan over a parallel plan,
> the more rows there are in the table. Yet the performance of a
> parallel plan appears to be better than non-parallel-plan the more
> rows there are in the table.
> This doesn't seem right to me. Is there a rationale behind this costing model?
>

Yes, AFAIK, there is no proof that we can get any (much) gain by
dividing the I/O among workers. It is primarily the CPU effort which
gives the benefit. So, the parallel plans show greater benefit when we
have to scan a large table and then project much lesser rows.

-- 
With Regards,
Amit Kapila.

pgsql-hackers by date:

From: John Naylor
Date: 09 October 2020, 11:08:26
Subject: Re: [PATCH] ecpg: fix progname memory leak

From: Greg Nancarrow
Date: 09 October 2020, 12:23:47
Subject: Re: Parallel INSERT (INTO ... SELECT ...)

Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers

Previous

Next