Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAH2-Wz=__=SHiQ96bg=sLOeTdW6xbdgtNVeNW5nuyHPCZvB=-g@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, but this would mean that now with parallel create index, it is
> possible that some tuples from the transaction would end up in index
> and others won't.

You mean some tuples from some past transaction that deleted a bunch
of tuples and committed, but not before someone acquired a still-held
snapshot that didn't see the deleter's transaction as committed yet?

I guess that that is different, but it doesn't matter. All that
matters is that in the end, the index contains all entries for all
heap tuples visible to any possible snapshot (though possibly
excluding some existing old snapshots iff we detect broken HOT chains
during builds).

> In general, this makes me slightly nervous mainly
> because such a case won't be possible without the parallel option for
> create index, but if you and Robert are okay with it as there is no
> fundamental problem, then we might as well leave it as it is or maybe
> add a comment saying so.

Let me try to explain this another way, in terms of the high-level
intuition that I have about it (Robert can probably skip this part).

GetOldestXmin() returns a value that is inherently a *conservative*
cut-off. In hot standby mode, it's possible for the value it returns
to go backwards from a value previously returned within the same
backend.

Even with serial builds, the exact instant that GetOldestXmin() gets
called could vary based on something like the OS scheduling of the
process that runs CREATE INDEX. It could have a different value based
only on that. It follows that it won't matter if parallel CREATE INDEX
participants have a slightly different value, because the cut-off is
all about the consistency of the index with what the universe of
possible snapshots could see in the heap, not the consistency of
different parts of the index with each other (the parts produced from
heap tuples read from each participant).

Look at how the pg_visibility module calls GetOldestXmin() to recheck
-- it has to call GetOldestXmin() a second time, with a buffer lock
held on a heap page throughout. It does this to conclusively establish
that the visibility map is corrupt (otherwise, it could just be that
the cut-off became stale).

Putting all of this together, it would be safe for the
HEAPTUPLE_RECENTLY_DEAD case within IndexBuildHeapRangeScan() to call
GetOldestXmin() again (a bit like pg_visibility does), to avoid having
to index an actually-fully-dead-by-now tuple (we could call
HeapTupleSatisfiesVacuum() a second time for the heap tuple, hoping to
get HEAPTUPLE_DEAD the second time around). This optimization wouldn't
work out a lot of the time (it would only work out when an old
snapshot went away during the CREATE INDEX), and would add
procarraylock traffic, so we don't do it. But AFAICT it's feasible.

> Another point is that the information about broken hot chains
> indexInfo->ii_BrokenHotChain is getting lost.  I think you need to
> coordinate this information among backends that participate in
> parallel create index.  Test to reproduce the problem is as below:
>
> create table tbrokenchain(c1 int, c2 varchar);
> insert into tbrokenchain values(3, 'aaa');
>
> begin;
> set force_parallel_mode=on;
> update tbrokenchain set c2 = 'bbb' where c1=3;
> create index idx_tbrokenchain on tbrokenchain(c1);
> commit;
>
> Now, check the value of indcheckxmin in pg_index, it should be true,
> but with patch it is false.  You can try with patch by not changing
> the value of force_parallel_mode;

Ugh, you're right. That's a real howler. Will fix.

Note that my stress-testing strategy has had a lot to do with
verifying that a serial build has relfiles that are physically
identical to parallel builds. Obviously that couldn't have caught
this, because this only concerns the state of the pg_index catalog.

> The patch uses both parallel_leader_participation and
> force_parallel_mode, but it seems the definition is different from
> what we have in Gather.  Basically, even with force_parallel_mode, the
> leader is participating in parallel build. I see there is some
> discussion above about both these parameters and still, there is not
> complete agreement on the best way forward.  I think we should have
> parallel_leader_participation as that can help in testing if nothing
> else.

I think that you're quite right that parallel_leader_participation
needs to be supported for testing purposes. I had some sympathy for
the idea that we should remove leader participation as a worker from
the patch entirely, but the testing argument seems to clinch it. I'm
fine with killing force_parallel_mode, though, because it will be
possible to force the use of parallelism by using the existing
parallel_workers table storage param in the next version of the patch,
regardless of how small the table is.

Thanks for the review.
-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: GSoC 2018
Next
From: Michael Meskes
Date:
Subject: Re: [PATCH] ECPG bug fix in preproc when indicator struct isshorter than record struct