On Mon, Aug 6, 2018 at 1:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> To be clear, I mean that the leader process's worker state has the
> right relfilenode (the leader process always participates as a
> worker), but all worker processes have the stale relfilenode.
Sure enough, that's what the bug is - a few debugging calls to
RelationMapFilenodeToOid() within nbtsort.c proves it. Several
approaches to fixing the bug occur to me:
* Ban parallel CREATE INDEX for all catalogs. This was how things were
up until several weeks before the original patch was committed.
* Ban parallel CREATE INDEX for mapped catalogs only.
* Find a way to propagate the state necessary to have parallel workers
agree with the leader on the correct relfilenode.
We could probably propagate backend-local state like
active_local_updates without too much difficulty, which looks like it
would fix the problem. Note that we did something very similar with
reindex-pending-indexes lists in commit 29d58fd3. That commit
similarly involved propagating more backend-local state so that
parallel index builds (or at least REINDEX) on catalogs could be
enabled/work reliably. Maybe we should continue down the road of
making parallel builds work on catalogs, on general principle.
Thoughts?
--
Peter Geoghegan