Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAH2-WzmafkfBGJ3Euk6+qr8U5UcRFX9eiC9LaMSWTQ+YYC=oUA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Wed, Jan 10, 2018 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The addition to README.parallel is basically wrong, because workers
> have been allowed to write WAL since the ParallelContext machinery.
> See the
> XactLastRecEnd handling in parallel.c.  Workers can, for example, due
> HOT cleanups during SELECT scans, just as the leader can.  The
> language here is obsolete anyway in light of commit
> e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way
> to update it.  I'll propose a separate patch for that.

WFM.

> The change to the ParallelContext signature in parallel.h makes an
> already-overlength line even longer.  A line break seems warranted
> just after the first argument, plus pgindent afterward.

Okay.

> I am not a fan of the leader-as-worker terminology.  The leader is not
> a worker, full stop.  I think we should instead talk about whether the
> leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates,
> for example, plus many comment updates).  Similarly, it seems
> SortCoordinateData's nLaunched should be nParticipants, and BTLeader's
> nworkertuplesorts should be nparticipanttuplesorts.

Okay.

> There is also the question of whether we want to respect
> parallel_leader_participation in this context.  The issues which might
> motivate the desire for such behavior in the context of a query do not
> exist when creating a btree index, so maybe we're just making this
> complicated. On the other hand, if some other type of parallel index
> build does end up doing a Gather-like operation then we might regret
> deciding that parallel_leader_participation doesn't apply to index
> builds, so maybe it's OK the way we have it.  On the third hand, the
> complexity of having the leader maybe-participate seems like it
> extends to a fair number of places in the code, and getting rid of all
> that complexity seems appealing.

I only added support for the leader-as-worker case because I assumed
that it was important to have CREATE INDEX process allocation work
"analogously" to parallel query, even though it's clear that the two
situations are not really completely comparable when you dig deep
enough. Getting rid of the leader participating as a worker has
theoretical downsides, but real practical upsides. I am also tempted
to just get rid of it.

> One place where this actually causes a problem is the message changes
> to index_build().  The revised ereport() violates translatability
> guidelines, which require that messages not be assembled from pieces.
> See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES

Noted. Another place where a worker Tuplesortstate in the leader
process causes problems is plan_create_index_workers(), especially
because of things like force_parallel_mode and
parallel_leader_participation.

> A comment added to tuplesort.h says that work_mem should be at least
> 64KB, but does not give any reason.  I think one should be given, at
> least briefly, so that someone looking at these comments in the future
> can, for example, figure out whether the comment is still correct
> after future code changes.  Or else, remove the comment.

The reason for needing to do this is that a naive division of
work_mem/maintenance_work_mem within a caller like nbtsort.c, could,
in general, result in a workMem that is as low as 0 (due to integer
truncation of the result of a division). Clearly *that* is too low. In
fact, we need at least enough memory to store the initial minimal
memtuples array, which needs to respect ALLOCSET_SEPARATE_THRESHOLD.
There is also the matter of having per-tape space for
TAPE_BUFFER_OVERHEAD when we spill to disk (note also the special case
for pass-by-value datum sorts low on memory). There have been a couple
of unavoidable OOM bugs in tuplesort over the years already.

How about I remove the comment, but have tuplesort_begin_common()
force each Tuplesortstate to have workMem that is at least 64KB
(minimum legal work_mem value) in all cases? We can just formalize the
existing assumption that workMem cannot go below 64KB, really, and it
isn't reasonably to use so little workMem within a parallel worker (it
should be prevented by plan_create_index_workers() in the real world,
where parallelism is never artificially forced).

There is no need to make this complicated by worrying about whether or
not 64KB is the true minimum (value that avoids "can't happen"
errors), IMV.

> + * Parallel sort callers are required to coordinate multiple tuplesort states
> + * in a leader process, and one or more worker processes.  The leader process
>
> I think the comma should be removed.  As written it, it looks like we
> are coordinating multiple tuplesort states in a leader process, and,
> separately, we are coordinating one or more worker processes.

Okay.

> Generally, I think the comments in tuplesort.h are excellent.

Thanks.

> I really like the overview of how the new interfaces should be used,
> although I find it slightly wonky that the leader needs two separate
> Tuplesortstates if it wants to participate.

Assuming that we end up actually allowing the leader to participate as
a worker at all, then I think that having that be a separate
Tuplesortstate is better than the alternative. There are a couple of
places where I can see it mattering. For one thing, dtrace compatible
traces become more complicated -- LogicalTapeSetBlocks() is reported
to dtrace within workers (though not via trace_sort logging, where it
is considered redundant). For another, I think we'd need to have
multiple tapesets at the same time for the leader if it only had one
Tuplesortstate, which means multiple new Tuplesortstate fields.

In short, having a distinct Tuplesortstate means almost no special
cases. Maybe you find it slightly wonky because parallel CREATE INDEX
really does have the leader participate as a worker with minimal
caveats. It will do just as much work as a real parallel worker
process, which really is quite a new thing, in a way.

> I don't understand why this patch needs to tinker with the tests in
> vacuum.sql.  The comments say that "If we did not do this, errors
> raised would concern running ANALYZE in parallel mode."  However, why
> should parallel CREATE INDEX having any impact on ANALYZE at all?
> Also, as a practical matter, if I revert those changes, 'make check'
> still passes with or without force_parallel_mode=on.

This certain wasn't true before now -- parallel CREATE INDEX could
previously cause the test to give different output for one error
message. I'll revert that change.

I imagine (though haven't verified) that this happened because, as you
pointed out separately, I didn't get the memo about e9baa5e9 (this is
the commit you mentioned in relation to README.parallel/parallel write
DML).

> I really dislike the fact that this patch invents another thing for
> force_parallel_mode to do.  I invented force_parallel_mode mostly as a
> way of testing that functions were correctly labeled for
> parallel-safety, and I think it would be just fine if it never does
> anything else.

This is not something that I feel strongly about, though I think it is
useful to test parallel CREATE INDEX in low memory conditions, one way
or another.

> I don't think it will be a
> good thing for PostgreSQL if we end up with force_parallel_mode=on as
> a general "use parallelism even though it's stupid" flag, requiring
> supporting code in many different places throughout the code base and
> a laundry list of not-actually-useful behavior changes in the
> documentation.

I will admit that "use parallelism even though it's stupid" is how I
thought of force_parallel_mode=on. I thought of it as a testing option
that users shouldn't need to concern themselves with in almost all
cases. I am not at all attached to what I did with
force_parallel_mode, except that it provides some way to test low
memory conditions, and it was something that I thought you'd expect
from this patch.

> What I think would be a lot more useful, and what I sort of expected
> the patch to have, is a way for a user to explicitly control the
> number of workers requested for a CREATE INDEX operation.

I tend to agree. It wouldn't be *very* compelling, because there
doesn't seem to be that much to how many workers are used anyway, but
it's worth having.

> We all know
> that the cost model is crude and that may be OK -- though it would be
> interesting to see some research on what the run times actually look
> like for various numbers of workers at various table sizes and
> work_mem settings -- but it will be inconvenient for DBAs who actually
> know what number of workers they want to use to instead get whatever
> value plan_create_index_workers() decide to emit.

I did a lot of unpublished research on this over a year ago, and
noticed nothing strange then. I guess I could use the box that
Postgres Pro provided me with access to to revisit it.

> They can force it
> by setting the parallel_workers reloption, but that affects queries.
> They can probably also do it by setting min_parallel_table_scan_size =
> 0 and  max_parallel_workers_maintenance to whatever value they want,
> but I think it would be convenient for there to be a more
> straightforward way to do it, or at least some documentation in the
> CREATE INDEX page about how to get the number of workers you really
> want.  To be clear, I don't think that this is a must-fix issue for
> this patch to get committed, but I do think that all reference to
> force_parallel_mode=on should go away.

The only reason I didn't add a "just use this many parallel workers"
option myself already is that doing so introduces awkward ambiguities.
Long ago, there was a parallel_workers index storage param added by
the patch, which you didn't like because it confused the issue in just
the same way as the table parallel_workers storage param does now,
would have confused parallel index scan, and so on. I counter-argued
that though this was ugly, it seemed to be how it worked on other
systems (more of an explanation than an argument, actually, because I
find it hard to know what to do here).

You're right that there should be a way to simply force the number of
parallel workers for DDL commands that use parallelism. You're also
right to be concerned about that not being a storage parameter (index
or otherwise), because that modifies run time behavior in a surprising
way (even if this pitfall *is* actually something that users of SQL
Server and Oracle have to live with).  Adding something to the CREATE
INDEX grammar just for this *also* seems confusing, because users will
think that it is a storage parameter even though it isn't (I'm pretty
sure that almost no Postgres user can give you a definition of a
storage parameter without some prompting).

I share your general feelings on all of this, but I really don't know
what to do about it. Which of these alternatives is the least worst,
all things considered?

> I do not like the way that this patch wants to turn the section of the
> documentation on when parallel query can be used into a discussion of
> when parallelism can be used.  I think it would be better to leave
> that section alone and instead document under CREATE INDEX the
> concerns specific to parallel index build. I think this will be easier
> for users to understand and far easier to maintain as the number of
> parallel DDL operations increases, which I expect it to do somewhat
> explosively.

WFM.

> The documentation for max_parallel_workers_maintenance cribs from the
> documentation for max_parallel_workers_per_gather in saying that we'll
> use fewer workers than expected "which may be inefficient".  However,
> for parallel CREATE INDEX, that trailing clause is, at least as far as
> I can see, not applicable.

Fair point. Will revise.

> (Various points on phrasing and punctuation)

That all seems fine.

> + * To support parallel sort operations involving coordinated callers to
> + * tuplesort.c routines across multiple workers, it is necessary to
> + * concatenate each worker BufFile/tapeset into one single leader-wise
> + * logical tapeset.  Workers should have produced one final materialized
> + * tape (their entire output) when this happens in leader; there will always
> + * be the same number of runs as input tapes, and the same number of input
> + * tapes as workers.
>
> I can't interpret the word "leader-wise".  A partition-wise join is a
> join done one partition at a time, but a leader-wise logical tape set
> is not done one leader at a time.  If there's another meaning to the
> affix -wise, I'm not familiar with it.  Don't we just mean "a single
> logical tapeset managed by the leader"?

Yes, we do. Will change.

> There's a lot here I haven't grokked yet, but I'm running out of
> mental energy so I think I'll send this for now and work on this some
> more when time permits, hopefully tomorrow.

The good news is that the things that you took issue with were about
what I expected you to take issue with. You seem to be getting through
the review of this patch very efficiently.

-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Challenges preventing us moving to 64 bit transaction id (XID)?
Next
From: Peter Geoghegan
Date:
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)