Re: [HACKERS] Cost model for parallel CREATE INDEX - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Cost model for parallel CREATE INDEX
Date
Msg-id CA+TgmoakYL5wfcpg8bnQViNiqwUbZnSpRYkNPVXT9-fxtvRzJw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Cost model for parallel CREATE INDEX  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] Cost model for parallel CREATE INDEX  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Thu, Mar 2, 2017 at 10:38 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I'm glad. This justifies the lack of much of any "veto" on the
> logarithmic scaling. The only thing that can do that is
> max_parallel_workers_maintenance, the storage parameter
> parallel_workers (maybe this isn't a storage parameter in V9), and
> insufficient maintenance_work_mem per worker (as judged by
> min_parallel_relation_size being greater than workMem per worker).
>
> I guess that the workMem scaling threshold thing could be
> min_parallel_index_scan_size, rather than min_parallel_relation_size
> (which we now call min_parallel_table_scan_size)?

No, it should be based on min_parallel_table_scan_size, because that
is the size of the parallel heap scan that will be done as input to
the sort.

>> I think it's totally counter-intuitive that any hypothetical index
>> storage parameter would affect the degree of parallelism involved in
>> creating the index and also the degree of parallelism involved in
>> scanning it.  Whether or not other systems do such crazy things seems
>> to me to beside the point.  I think if CREATE INDEX allows an explicit
>> specification of the degree of parallelism (a decision I would favor)
>> it should have a syntactically separate place for unsaved build
>> options vs. persistent storage parameters.
>
> I can see both sides of it.
>
> On the one hand, it's weird that you might have query performance
> adversely affected by what you thought was a storage parameter that
> only affected the index build. On the other hand, it's useful that you
> retain that as a parameter, because you may want to periodically
> REINDEX, or have a way of ensuring that pg_restore does go on to use
> parallelism, since it generally won't otherwise. (As mentioned
> already, pg_restore does not trust the cost model due to issues with
> the availability of statistics).

If you make the changes I'm proposing above, this parenthetical issue
goes away, because the only statistic you need is the table size,
which is what it is.  As to the rest, I think a bare REINDEX should
just use the cost model as if it were CREATE INDEX, and if you want to
override that behavior, you can do that by explicit syntax.  I see
very little utility for a setting that fixes the number of workers to
be used for future reindexes: there won't be many of them, and it's
kinda confusing.  But even if we decide to have that, I see no
justification at all for conflating it with the number of workers to
be used for a scan, which is something else altogether.

> To be clear, I don't have any strong feelings on all this. I just
> think it's worth pointing out that there are reasons to not do what
> you suggest, that you might want to consider if you haven't already.

I have considered them.  I also acknowledge that other people may view
the situation differently than I do.  I'm just telling you my opinion
on the topic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] 2017-03 Commitfest In Progress
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] I propose killing PL/Tcl's "modules" infrastructure