Thread: Parallel tuplesort (for parallel B-Tree index creation)

Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

01 August 2016, 22:18:31

As some of you know, I've been working on parallel sort. I think I've
gone as long as I can without feedback on the design (and I see that
we're accepting stuff for September CF now), so I'd like to share what
I came up with. This project is something that I've worked on
inconsistently since late last year. It can be thought of as the
Postgres 10 follow-up to the 9.6 work on external sorting.

Attached WIP patch series:

* Adds a parallel sorting capability to tuplesort.c.

* Adds a new client of this capability: btbuild()/nbtsort.c can now
create B-Trees in parallel.

Most of the complexity here relates to the first item; the tuplesort
module has been extended to support sorting in parallel. This is
usable in principle by every existing tuplesort caller, without any
restriction imposed by the newly expanded tuplesort.h interface. So,
for example, randomAccess MinimalTuple support has been added,
although it goes unused for now.

I went with CREATE INDEX as the first client of parallel sort in part
because the cost model and so on can be relatively straightforward.
Even CLUSTER uses the optimizer to determine if a sort strategy is
appropriate, and that would need to be taught about parallelism if its
tuplesort is to be parallelized. I suppose that I'll probably try to
get CLUSTER (with a tuplesort) done in the Postgres 10 development
cycle too, but not just yet.

For now, I would prefer to focus discussion on tuplesort itself. If
you can only look at one part of this patch, please look at the
high-level description of the interface/caller contract that was added
to tuplesort.h.

Performance
===========

Without further ado, I'll demonstrate how the patch series improves
performance in one case. This benchmark was run on an AWS server with
many disks. A d2.4xlarge instance was used, with 16 vCPUs, 122 GiB
RAM, 12 x 2 TB HDDs, running Amazon Linux. Apparently, this AWS
instance type can sustain 1,750 MB/second of I/O, which I was able to
verify during testing (when a parallel sequential scan ran, iotop
reported read throughput slightly above that for multi-second bursts).
Disks were configured in software RAID0. These instances have disks
that are optimized for sequential performance, which suits the patch
quite well. I don't usually trust AWS EC2 for performance testing, but
it seemed to work well here (results were pretty consistent).

Setup:

CREATE TABLE parallel_sort_test AS
    SELECT hashint8(i) randint,
    md5(i::text) collate "C" padding1,
    md5(i::text || '2') collate "C" padding2
    FROM generate_series(0, 1e9::bigint) i;

CHECKPOINT;

This leaves us with a parallel_sort_test table that is 94 GB in size.

SET maintenance_work_mem = '8GB';

-- Serial case (external sort, should closely match master branch):
CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 0);

Total time: 00:15:42.15

-- Patch with 8 tuplesort "sort-and-scan" workers (leader process
participates as a worker here):
CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 7);

Total time: 00:06:03.86

As you can see, the parallel case is 2.58x faster (while using more
memory, though it's not the case that a higher maintenance_work_mem
setting speeds up the serial/baseline index build). 8 workers are a
bit faster than 4, but not by much (not shown). 16 are a bit slower,
but not by much (not shown).

trace_sort output for "serial_idx" case:
"""
begin index sort: unique = f, workMem = 8388608, randomAccess = f
switching to external sort with 501 tapes: CPU 7.81s/25.54u sec
elapsed 33.95 sec
*** SNIP ***
performsort done (except 7-way final merge): CPU 53.52s/666.89u sec
elapsed 731.67 sec
external sort ended, 2443786 disk blocks used: CPU 74.40s/854.52u sec
elapsed 942.15 sec
"""

trace_sort output for "patch_8_idx" case:
"""
begin index sort: unique = f, workMem = 8388608, randomAccess = f
*** SNIP ***
sized memtuples 1.62x from worker's 130254158 (3052832 KB) to
210895910 (4942873 KB) for leader merge (0 KB batch memory conserved)
*** SNIP ***
tape -1/7 initially used 411907 KB of 430693 KB batch (0.956) and
26361986 out of 26361987 slots (1.000)
performsort done (except 8-way final merge): CPU 12.28s/101.76u sec
elapsed 129.01 sec
parallel external sort ended, 2443805 disk blocks used: CPU
30.08s/318.15u sec elapsed 363.86 sec
"""

This is roughly the degree of improvement that I expected when I first
undertook this project late last year. As I go into in more detail
below, I believe that we haven't exhausted all avenues to make
parallel CREATE INDEX faster still, but I do think what's left on the
table is not enormous.

There is less benefit when sorting on a C locale text attribute,
because the overhead of merging dominates parallel sorts, and that's
even more pronounced with text. So, many text cases tend to work out
at about only 2x - 2.2x faster. We could work on this indirectly.

I've seen cases where a CREATE INDEX ended up more than 3x faster,
though. I benchmarked this case in the interest of simplicity (the
serial case is intended to be comparable, making the test fair).
Encouragingly, as you can see from the trace_sort output, the 8
parallel workers are 5.67x faster at getting to the final merge (a
merge that even it performs serially). Note that the final merge for
each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8
workers). Not bad!

Design:  New, key concepts for tuplesort.c
==========================================

The heap is scanned in parallel, and worker processes also merge in
parallel if required (it isn't required in the example above). The
implementation makes heavy use of existing external sort
infrastructure. In fact, it's almost the case that the implementation
is a generalization of external sorting that allows workers to perform
heap scanning and run sorting independently, with tapes then "unified"
in the leader process for merging. At that point, the state held by
the leader is more or less consistent with the leader being a serial
external sort process that has reached its merge phase in the
conventional manner (serially).

The steps callers must take are described fully in tuplesort.h. The
general idea is that a Tuplesortstate is aware that it might not be a
self-contained sort; it may instead be one part of a parallel sort
operation. You might say that the tuplesort caller must "build its own
sort" from participant worker process Tuplesortstates. The caller
creates a dynamic shared memory segment + TOC for each parallel sort
operation (could be more than one concurrent sort operation, of
course), passes that to tuplesort to initialize and manage, and
creates a "leader" Tuplesortstate in private memory, plus one or more
"worker" Tuplesortstates, each presumably managed by a different
parallel worker process.

tuplesort.c does most of the heavy lifting, including having processes
wait on each other to respect its ordering dependencies. Caller is
responsible for spawning workers to do the work, reporting details of
the workers to tuplesort through shared memory, and having workers
call tuplesort to actually perform sorting. Caller consumes final
output through leader Tuplesortstate in leader process.

I think that this division of labor works well for us.

Tape unification
----------------

Sort operations have a unique identifier, generated before any workers
are launched, using a scheme based on the leader's PID, and a unique
temp file number. This makes all on-disk state (temp files managed by
logtape.c) discoverable by the leader process. State in shared memory
is sized in proportion to the number of workers, so the only thing
about the data being sorted that gets passed around in shared memory
is a little logtape.c metadata for tapes, describing for example how
large each constituent BufFile is (a BufFile associated with one
particular worker's tapeset).

(See below also for notes on buffile.c's role in all of this, fd.c and
resource management, etc.)

workMem
-------

Each worker process claims workMem as if it was an independent node.

The new implementation reuses much of what was originally designed for
external sorts. As such, parallel sorts are necessarily external
sorts, even when the workMem (i.e. maintenance_work_mem) budget could
in principle allow for parallel sorting to take place entirely in
memory. The implementation arguably *insists* on making such cases
external sorts, when they don't really need to be. This is much less
of a problem than you might think, since the 9.6 work on external
sorting does somewhat blur the distinction between internal and
external sorts (just consider how much time trace_sort indicates is
spent waiting on writes in workers; it's typically a small part of the
total time spent). Since parallel sort is really only compelling for
large sorts, it makes sense to make them external, or at least to
prioritize the cases that should be performed externally.

Anyway, workMem-not-exceeded cases require special handling to not
completely waste memory. Statistics about worker observations are used
at later stages, to at least avoid blatant waste, and to ensure that
memory is used optimally more generally.

Merging
=======

The model that I've come up with is that every worker process is
guaranteed to output one materialized run onto one tape for the leader
to merge within from its "unified" tapeset. This is the case
regardless of how much workMem is available, or any other factor. The
leader always assumes that the worker runs/tapes are present and
discoverable based only on the number of known-launched worker
processes, and a little metadata on each that is passed through shared
memory.

Producing one output run/materialized tape from all input tuples in a
worker often happens without the worker running out of workMem, which
you saw above. A straight quicksort and dump of all tuples is
therefore possible, without any merging required in the worker.
Alternatively, it may prove necessary to do some amount of merging in
each worker to generate one materialized output run. This case is
handled in the same way as a randomAccess case that requires one
materialized output tape to support random access by the caller. This
worker merging does necessitate another pass over all temp files for
the worker, but that's a much lower cost than you might imagine, in
part because the newly expanded use of batch memory makes merging here
cache efficient.

Batch allocation is used for all merging involved here, not just the
leader's own final-on-the-fly merge, so merging is consistently cache
efficient. (Workers that must merge on their own are therefore similar
to traditional randomAccess callers, so these cases become important
enough to optimize with the batch memory patch, although that's still
independently useful.)

No merging in parallel
----------------------

Currently, merging worker *output* runs may only occur in the leader
process. In other words, we always keep n worker processes busy with
scanning-and-sorting (and maybe some merging), but then all processes
but the leader process grind to a halt (note that the leader process
can participate as a scan-and-sort tuplesort worker, just as it will
everywhere else, which is why I specified "parallel_workers = 7" but
talked about 8 workers).

One leader process is kept busy with merging these n output runs on
the fly, so things will bottleneck on that, which you saw in the
example above. As already described, workers will sometimes merge in
parallel, but only their own runs -- never another worker's runs. I
did attempt to address the leader merge bottleneck by implementing
cross-worker run merging in workers. I got as far as implementing a
very rough version of this, but initial results were disappointing,
and so that was not pursued further than the experimentation stage.

Parallel merging is a possible future improvement that could be added
to what I've come up with, but I don't think that it will move the
needle in a really noticeable way.

Partitioning for parallelism (samplesort style "bucketing")
-----------------------------------------------------------

Perhaps a partition-based approach would be more effective than
parallel merging (e.g., redistribute slices of worker runs across
workers along predetermined partition boundaries, sort a range of
values within dedicated workers, then concatenate to get final result,
a bit like the in-memory samplesort algorithm). That approach would
not suit CREATE INDEX, because the approach's great strength is that
the workers can run in parallel for the entire duration, since there
is no merge bottleneck (this assumes good partition boundaries, which
is of a bit risky assumption). Parallel CREATE INDEX wants something
where the workers can independently write the index, and independently
WAL log, and independently create a unified set of internal pages, all
of which is hard.

This patch series will tend to proportionally speed up CREATE INDEX
statements at a level that is comparable to other major database
systems. That's enough progress for one release. I think that
partitioning to sort is more useful for query execution than for
utility statements like CREATE INDEX.

Partitioning and merge joins
----------------------------

Robert has often speculated about what it would take to make merge
joins work well in parallel. I think that "range
distribution"/bucketing will prove an important component of that.
It's just too useful to aggregate tuples in shared memory initially,
and have workers sort them without any serial merge bottleneck;
arguments about misestimations, data skew, and so on should not deter
us from this, long term. This approach has minimal IPC overhead,
especially with regard to LWLock contention.

This kind of redistribution probably belongs in a Gather-like node,
though, which has access to the context necessary to determine a
range, and even dynamically alter the range in the event of a
misestimation. Under this scheme, tuplesort.c just needs to be
instructed that these worker-private Tuplesortstates are
range-partitioned (i.e., the sorts are virtually independent, as far
as it's concerned). That's a bit messy, but it is still probably the
way to go for merge joins and other sort-reliant executor nodes.

buffile.c, and "unification"
============================

There has been significant new infrastructure added to make logtape.c
aware of workers. buffile.c has in turn been taught about unification
as a first class part of the abstraction, with low-level management of
certain details occurring within fd.c. So, "tape unification" within
processes to open other backend's logical tapes to generate a unified
logical tapeset for the leader to merge is added. This is probably the
single biggest source of complexity for the patch, since I must
consider:

* Creating a general, reusable abstraction for other possible BufFile
users (logtape.c only has to serve tuplesort.c, though).

* Logical tape free space management.

* Resource management, file lifetime, etc. fd.c resource management
can now close a file at xact end for temp files, while not deleting it
in the leader backend (only the "owning" worker backend deletes the
temp file it owns).

* Crash safety (e.g., when to truncate existing temp files, and when not to).

CREATE INDEX user interface
===========================

There are two ways of determine how many parallel workers a CREATE
INDEX requests:

* A cost model, which is closely based on create_plain_partial_paths()
at the moment. This needs more work, particularly to model things like
maintenance_work_mem. Even still, it isn't terrible.

* A parallel_workers storage parameter, which completely bypasses the
cost model. This is the "DBA knows best" approach, and is what I've
consistently used during testing.

Corey Huinker has privately assisted me with performance testing the
patch, using his own datasets. Testing has exclusively used the
storage parameter.

I've added a new GUC, max_parallel_workers_maintenance, which is
essentially the utility statement equivalent of
max_parallel_workers_per_gather. This is clearly necessary, since
we're using up to maintenance_work_mem per worker, which is of course
typically much higher than work_mem. I didn't feel the need to create
a new maintenance-wise variant GUC for things like
min_parallel_relation_size, though. Only this one new GUC is added
(plus the new storage parameter, parallel_workers, not to be confused
with the existing table storage parameter of the same name).

I am much more concerned about the tuplesort.h interface than the
CREATE INDEX user interface as such. The user interface is merely a
facade on top of tuplesort.c and nbtsort.c (and not one that I'm
particularly attached to).

--
Peter Geoghegan

On Wed, Aug 3, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Since merging is a big bottleneck with this, we should probably also
> work to address that indirectly.

I attach a patch that changes how we maintain the heap invariant
during tuplesort merging. I already mentioned this over on the
"Parallel tuplesort, partitioning, merging, and the future" thread. As
noted already on that thread, this patch makes merging clustered
numeric input about 2.1x faster overall in one case, which is
particularly useful in the context of a serial final/leader merge
during a parallel CREATE INDEX. Even *random* non-C-collated text
input is made significantly faster. This work is totally orthogonal to
parallelism, though; it's just very timely, given our discussion of
the merge bottleneck on this thread.

If I benchmark a parallel build of a 100 million row index, with
presorted input, I can see a 71% reduction in *comparisons* with 8
tapes/workers, and an 80% reduction in comparisons with 16
workers/tapes in one instance (the numeric case I just mentioned).
With random input, we can still come out significantly ahead, but not
to the same degree. I was able to see a reduction in comparisons
during a leader merge, from 1,468,522,397 comparisons to 999,755,569
comparisons, which is obviously still quite significant (worker
merges, if any, benefit too). I think I need to redo my parallel
CREATE INDEX benchmark, so that you can take this into account. Also,
I think that this patch will make very large external sorts that
naturally have tens of runs to merge significantly faster, but I
didn't bother to benchmark that.

The patch is intended to be applied on top of parallel B-Tree patches
0001-* and 0002-* [1]. I happened to test it with parallelism, but
these are all independently useful, and will be entered as a separate
CF entry (perhaps better to commit the earlier two patches first, to
avoid merge conflicts). I'm optimistic that we can get those 3 patches
in the series out of the way early, without blocking on discussing
parallel sort.

The patch makes tuplesort merging shift down and displace the root
tuple with the tape's next preread tuple, rather than compacting and
then inserting into the heap anew. This approach to maintaining the
heap as tuples are returned to caller will always produce fewer
comparisons overall. The new approach is also simpler. We were already
shifting down to compact the heap within the misleadingly named [2]
function tuplesort_heap_siftup() -- why not instead just use the
caller tuple (the tuple that we currently go on to insert) when
initially shifting down (not the heap's preexisting last tuple, which
is guaranteed to go straight to the leaf level again)? That way, we
don't need to enlarge the heap at all through insertion, shifting up,
etc. We're done, and are *guaranteed* to have performed less work
(fewer comparisons and swaps) than with the existing approach (this is
the reason for my optimism about getting this stuff out of the way
early).

This new approach is more or less the *conventional* way to maintain
the heap invariant when returning elements from a heap during k-way
merging. Our existing approach is convoluted; merging was presumably
only coded that way because the generic functions
tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
available. Perhaps the problem was masked by unrelated bottlenecks
that existed at the time, too.

I think that I could push this further (a minimum of 2 comparisons per
item returned when 3 or more tapes are active still seems like 1
comparison too many), but what I have here gets us most of the
benefit. And, it does so while not actually adding code that could be
called "overly clever", IMV. I'll probably leave clever, aggressive
optimization of merging for a later release.

[1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAM3SWZQ+2gJMNV7ChxwEXqXopLfb_FEW2RfEXHJ+GsYF39f6MQ@mail.gmail.com
--
Peter Geoghegan

Attachment

0003-Displace-heap-s-root-during-tuplesort-merge.patch

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

22 August 2016, 21:43:18

On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached WIP patch series:

This has bitrot, since commit da1c9163 changed the interface for
checking parallel safety. I'll have to fix that, and will probably
take the opportunity to change how workers have maintenance_work_mem
apportioned while I'm at it. To recap, it would probably be better if
maintenance_work_mem remained a high watermark for the entire CREATE
INDEX, rather than applying as a per-worker allowance.


-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

06 September 2016, 07:08:59

On 08/16/2016 03:33 AM, Peter Geoghegan wrote:
> I attach a patch that changes how we maintain the heap invariant
> during tuplesort merging. I already mentioned this over on the
> "Parallel tuplesort, partitioning, merging, and the future" thread. As
> noted already on that thread, this patch makes merging clustered
> numeric input about 2.1x faster overall in one case, which is
> particularly useful in the context of a serial final/leader merge
> during a parallel CREATE INDEX. Even *random* non-C-collated text
> input is made significantly faster. This work is totally orthogonal to
> parallelism, though; it's just very timely, given our discussion of
> the merge bottleneck on this thread.

Nice!

> The patch makes tuplesort merging shift down and displace the root
> tuple with the tape's next preread tuple, rather than compacting and
> then inserting into the heap anew. This approach to maintaining the
> heap as tuples are returned to caller will always produce fewer
> comparisons overall. The new approach is also simpler. We were already
> shifting down to compact the heap within the misleadingly named [2]
> function tuplesort_heap_siftup() -- why not instead just use the
> caller tuple (the tuple that we currently go on to insert) when
> initially shifting down (not the heap's preexisting last tuple, which
> is guaranteed to go straight to the leaf level again)? That way, we
> don't need to enlarge the heap at all through insertion, shifting up,
> etc. We're done, and are *guaranteed* to have performed less work
> (fewer comparisons and swaps) than with the existing approach (this is
> the reason for my optimism about getting this stuff out of the way
> early).

Makes sense.

> This new approach is more or less the *conventional* way to maintain
> the heap invariant when returning elements from a heap during k-way
> merging. Our existing approach is convoluted; merging was presumably
> only coded that way because the generic functions
> tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
> available. Perhaps the problem was masked by unrelated bottlenecks
> that existed at the time, too.

Yeah, this seems like a very obvious optimization. Is there a standard 
name for this technique in the literature? I'm OK with "displace", or 
perhaps just "replace" or "siftup+insert", but if there's a standard 
name for this, let's use that.

- Heikki

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

06 September 2016, 07:35:00

I'm reviewing patches 1-3 in this series, i.e. those patches that are 
not directly related to parallelism, but are independent improvements to 
merging.

Let's begin with patch 1:

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> Cap the number of tapes used by external sorts
>
> Commit df700e6b set merge order based on available buffer space (the
> number of tapes was as high as possible while still allowing at least 32
> * BLCKSZ buffer space per tape), rejecting Knuth's theoretically
> justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's P),
> improving performance when the sort thereby completed in one pass.
> However, it's still true that there are unlikely to be benefits from
> increasing the number of tapes past 7 once the amount of data to be
> sorted significantly exceeds available memory; that commit probably
> mostly just improved matters where it enabled all merging to be done in
> a final on-the-fly merge.
>
> One problem with the merge order logic established by that commit is
> that with large work_mem settings and data volumes, the tapes previously
> wasted as much as 8% of the available memory budget; tens of thousands
> of tapes could be logically allocated for a sort that will only benefit
> from a few dozen.

Yeah, wasting 8% of the memory budget on this seems like a bad idea. If 
I understand correctly, that makes the runs shorter than necessary, 
leading to more runs.

> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
> This is a conservative estimate of the number of runs at which doing all
> merging on-the-fly no longer allows greater overlapping of I/O and
> computation.

Hmm. Surely there are cases, so that with > 501 tapes you could do it 
with one merge pass, but now you need two? And that would hurt 
performance, no?

Why do we reserve the buffer space for all the tapes right at the 
beginning? Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) 
callin inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every 
time we start a new run, until we reach maxTapes?

- Heikki

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 September 2016, 19:39:47

On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I attach a patch that changes how we maintain the heap invariant
>> during tuplesort merging.

> Nice!

Thanks!

>> This new approach is more or less the *conventional* way to maintain
>> the heap invariant when returning elements from a heap during k-way
>> merging. Our existing approach is convoluted; merging was presumably
>> only coded that way because the generic functions
>> tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
>> available. Perhaps the problem was masked by unrelated bottlenecks
>> that existed at the time, too.
>
>
> Yeah, this seems like a very obvious optimization. Is there a standard name
> for this technique in the literature? I'm OK with "displace", or perhaps
> just "replace" or "siftup+insert", but if there's a standard name for this,
> let's use that.

I used the term "displace" specifically because it wasn't a term with
a well-defined meaning in the context of the analysis of algorithms.
Just like "insert" isn't for tuplesort_heap_insert(). I'm not
particularly attached to the name tuplesort_heap_root_displace(), but
I do think that whatever it ends up being called should at least not
be named after an implementation detail. For example,
tuplesort_heap_root_replace() also seems fine.

I think that tuplesort_heap_siftup() should be called something like
tuplesort_heap_compact instead [1], since what it actually does
(shifting down -- the existing name is completely backwards!) is just
an implementation detail involved in compacting the heap (notice that
it decrements memtupcount, which, by now, means the k-way merge heap
gets one element smaller). I can write a patch to do this renaming, if
you're interested. Someone should fix it, because independent of all
this, it's just wrong.

[1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 September 2016, 19:42:51

On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I attach a patch that changes how we maintain the heap invariant
>>> during tuplesort merging.
>
>> Nice!
>
> Thanks!

BTW, the way that k-way merging is made more efficient by this
approach makes the case for replacement selection even weaker than it
was just before we almost killed it. I hate to say it, but I have to
wonder if we shouldn't get rid of the new-to-9.6
replacement_sort_tuples because of this, and completely kill
replacement selection. I'm not going to go on about it, but that seems
sensible to me.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Claudio Freire

Date:

06 September 2016, 19:57:36

On Mon, Aug 15, 2016 at 9:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The patch is intended to be applied on top of parallel B-Tree patches
> 0001-* and 0002-* [1]. I happened to test it with parallelism, but
> these are all independently useful, and will be entered as a separate
> CF entry (perhaps better to commit the earlier two patches first, to
> avoid merge conflicts). I'm optimistic that we can get those 3 patches
> in the series out of the way early, without blocking on discussing
> parallel sort.

Applied patches 1 and 2, builds fine, regression tests run fine. It
was a prerequisite to reviewing patch 3 (which I'm going to do below),
so I thought I might as well report on that tidbit of info, fwiw.

> The patch makes tuplesort merging shift down and displace the root
> tuple with the tape's next preread tuple, rather than compacting and
> then inserting into the heap anew. This approach to maintaining the
> heap as tuples are returned to caller will always produce fewer
> comparisons overall. The new approach is also simpler. We were already
> shifting down to compact the heap within the misleadingly named [2]
> function tuplesort_heap_siftup() -- why not instead just use the
> caller tuple (the tuple that we currently go on to insert) when
> initially shifting down (not the heap's preexisting last tuple, which
> is guaranteed to go straight to the leaf level again)? That way, we
> don't need to enlarge the heap at all through insertion, shifting up,
> etc. We're done, and are *guaranteed* to have performed less work
> (fewer comparisons and swaps) than with the existing approach (this is
> the reason for my optimism about getting this stuff out of the way
> early).

Patch 3 applies fine to git master as of
25794e841e5b86a0f90fac7f7f851e5d950e51e2 (on top of patches 1 and 2).

Builds fine and without warnings on gcc 4.8.5 AFAICT, regression test
suite runs without issues as well.

Patch lacks any new tests, but the changed code paths seem covered
sufficiently by existing tests. A little bit of fuzzing on the patch
itself, like reverting some key changes, or flipping some key
comparisons, induces test failures as it should, mostly in cluster.

The logic in tuplesort_heap_root_displace seems sound, except:

+                */
+               memtuples[i] = memtuples[imin];
+               i = imin;
+       }
+
+       Assert(state->memtupcount > 1 || imin == 0);
+       memtuples[imin] = *newtup;
+}

Why that assert? Wouldn't it make more sense to Assert(imin < n) ?

In the meanwhile, I'll go and do some perf testing.

Assuming the speedup is realized during testing, LGTM.

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 September 2016, 21:46:47

On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm reviewing patches 1-3 in this series, i.e. those patches that are not
> directly related to parallelism, but are independent improvements to
> merging.

That's fantastic! Thanks!

I'm really glad you're picking those ones up. I feel that I'm far too
dependent on Robert's review for this stuff. That shouldn't be taken
as a statement against Robert -- it's intended as quite the opposite
-- but it's just personally difficult to rely on exactly one other
person for something that I've put so much work into. Robert has been
involved with 100% of all sorting patches I've written, generally with
far less input from anyone else, and at this point, that's really
rather a lot of complex patches.

> Let's begin with patch 1:
>
> On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
>>
>> Cap the number of tapes used by external sorts

> Yeah, wasting 8% of the memory budget on this seems like a bad idea. If I
> understand correctly, that makes the runs shorter than necessary, leading to
> more runs.

Right. Quite simply, whatever you could have used the workMem for
prior to the merge step, now you can't. It's not so bad during the
merge step of a final on-the-fly merge (or, with the 0002-* patch, any
final merge), since you can get a "refund" of unused (though logically
allocated by USEMEM()) tapes to grow memtuples with (other overhead
forms the majority of the refund, though). That still isn't much
consolation to the user, because run generation is typically much more
expensive (we really just refund unused tapes because it's easy).

>> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
>> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
>> This is a conservative estimate of the number of runs at which doing all
>> merging on-the-fly no longer allows greater overlapping of I/O and
>> computation.
>
>
> Hmm. Surely there are cases, so that with > 501 tapes you could do it with
> one merge pass, but now you need two? And that would hurt performance, no?

In theory, yes, that could be true, and not just for my proposed new
cap of 500 for merge order (501 tapes), but for any such cap. I
noticed that the Greenplum tuplesort.c uses a max of 250, so I guess I
just thought that to double that. Way back in 2006, Tom and Simon
talked about a cap too on several occasions, but I think that that was
in the thousands then.

Hundreds of runs are typically quite rare. It isn't that painful to do
a second pass, because the merge process may be more CPU cache
efficient as a result, which tends to be the dominant cost these days
(over and above the extra I/O that an extra pass requires).

This seems like a very familiar situation to me: I pick a
quasi-arbitrary limit or cap for something, and it's not clear that
it's optimal. Everyone more or less recognizes the need for such a
cap, but is uncomfortable about the exact figure chosen, not because
it's objectively bad, but because it's clearly something pulled from
the air, to some degree. It may not make you feel much better about
it, but I should point out that I've read a paper that claims "Modern
servers of the day have hundreds of GB operating memory and tens of TB
storage capacity. Hence, if the sorted data fit the persistent
storage, the first phase will generate hundreds of runs at most." [1].

Feel free to make a counter-proposal for a cap. I'm not attached to
500. I'm mostly worried about blatant waste with very large workMem
sizings. Tens of thousands of tapes is just crazy. The amount of data
that you need to have as input is very large when workMem is big
enough for this new cap to be enforced.

> Why do we reserve the buffer space for all the tapes right at the beginning?
> Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin
> inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we
> start a new run, until we reach maxTapes?

No, because then you have no way to clamp back memory, which is now
almost all used (we hold off from making LACKMEM() continually true,
if at all possible, which is almost always the case). You can't really
continually shrink memtuples to make space for new tapes, which is
what it would take.

[1] http://ceur-ws.org/Vol-1343/paper8.pdf
-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 September 2016, 22:48:00

On Tue, Sep 6, 2016 at 2:46 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Feel free to make a counter-proposal for a cap. I'm not attached to
> 500. I'm mostly worried about blatant waste with very large workMem
> sizings. Tens of thousands of tapes is just crazy. The amount of data
> that you need to have as input is very large when workMem is big
> enough for this new cap to be enforced.

If tuplesort callers passed a hint about the number of tuples that
would ultimately be sorted, and (for the sake of argument) it was
magically 100% accurate, then theoretically we could just allocate the
right number of tapes up-front. That discussion is a big can of worms,
though. There are of course obvious disadvantages that come with a
localized cost model, even if you're prepared to add some "slop" to
the allocation size or whatever.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 September 2016, 23:28:53

On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Patch lacks any new tests, but the changed code paths seem covered
> sufficiently by existing tests. A little bit of fuzzing on the patch
> itself, like reverting some key changes, or flipping some key
> comparisons, induces test failures as it should, mostly in cluster.
>
> The logic in tuplesort_heap_root_displace seems sound, except:
>
> +                */
> +               memtuples[i] = memtuples[imin];
> +               i = imin;
> +       }
> +
> +       Assert(state->memtupcount > 1 || imin == 0);
> +       memtuples[imin] = *newtup;
> +}
>
> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?

There might only be one or two elements in the heap. Note that the
heap size is indicated by state->memtupcount at this point in the
sort, which is a little confusing (that differs from how memtupcount
is used elsewhere, where we don't partition memtuples into a heap
portion and a preread tuples portion, as we do here).

> In the meanwhile, I'll go and do some perf testing.
>
> Assuming the speedup is realized during testing, LGTM.

Thanks. I suggest spending at least as much time on unsympathetic
cases (e.g., only 2 or 3 tapes must be merged). At the same time, I
suggest focusing on a type that has relatively expensive comparisons,
such as collated text, to make differences clearer.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Claudio Freire

Date:

06 September 2016, 23:55:18

On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Patch lacks any new tests, but the changed code paths seem covered
>> sufficiently by existing tests. A little bit of fuzzing on the patch
>> itself, like reverting some key changes, or flipping some key
>> comparisons, induces test failures as it should, mostly in cluster.
>>
>> The logic in tuplesort_heap_root_displace seems sound, except:
>>
>> +                */
>> +               memtuples[i] = memtuples[imin];
>> +               i = imin;
>> +       }
>> +
>> +       Assert(state->memtupcount > 1 || imin == 0);
>> +       memtuples[imin] = *newtup;
>> +}
>>
>> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?
>
> There might only be one or two elements in the heap. Note that the
> heap size is indicated by state->memtupcount at this point in the
> sort, which is a little confusing (that differs from how memtupcount
> is used elsewhere, where we don't partition memtuples into a heap
> portion and a preread tuples portion, as we do here).

I noticed, but here n = state->memtupcount

+       Assert(memtuples[0].tupindex == newtup->tupindex);
+
+       CHECK_FOR_INTERRUPTS();
+
+       n = state->memtupcount;                 /* n is heap's size,
including old root */
+       imin = 0;                                               /*
start with caller's "hole" in root */
+       i = imin;

In fact, the assert on the patch would allow writing memtuples outside
the heap, as in calling tuplesort_heap_root_displace if
memtupcount==0, but I don't think that should be legal (memtuples[0]
== memtuples[imin] would be outside the heap).

Sure, that's a weird enough case (that assert up there already reads
memtuples[0] which would be equally illegal if memtupcount==0), but it
goes on to show that the assert expression just seems odd for its
intent.

BTW, I know it's not the scope of the patch, but shouldn't
root_displace be usable on the TSS_BOUNDED phase?

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 September 2016, 00:19:15

On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I noticed, but here n = state->memtupcount
>
> +       Assert(memtuples[0].tupindex == newtup->tupindex);
> +
> +       CHECK_FOR_INTERRUPTS();
> +
> +       n = state->memtupcount;                 /* n is heap's size,
> including old root */
> +       imin = 0;                                               /*
> start with caller's "hole" in root */
> +       i = imin;

I'm fine with using "n" in the later assertion you mentioned, if
that's clearer to you. memtupcount is broken out as "n" simply because
that's less verbose, in a place where that makes things far clearer.

> In fact, the assert on the patch would allow writing memtuples outside
> the heap, as in calling tuplesort_heap_root_displace if
> memtupcount==0, but I don't think that should be legal (memtuples[0]
> == memtuples[imin] would be outside the heap).

You have to have a valid heap (i.e. there must be at least one
element) to call tuplesort_heap_root_displace(), and it doesn't
directly compact the heap, so it must remain valid on return. The
assertion exists to make sure that everything is okay with a
one-element heap, a case which is quite possible. If you want to see a
merge involving one input tape, apply the entire parallel CREATE INDEX
patch set, set "force_parallal_mode = regress", and note that the
leader merge merges only 1 input tape, making the heap only ever
contain one element. In general, most use of the heap for k-way
merging will eventually end up as a one element heap, at the very end.

Maybe that assertion you mention is overkill, but I like to err on the
side of overkill with assertions. It doesn't seem that important,
though.

> Sure, that's a weird enough case (that assert up there already reads
> memtuples[0] which would be equally illegal if memtupcount==0), but it
> goes on to show that the assert expression just seems odd for its
> intent.
>
> BTW, I know it's not the scope of the patch, but shouldn't
> root_displace be usable on the TSS_BOUNDED phase?

I don't think it should be, no. With a top-n heap sort, the
expectation is that after a little while, we can immediately determine
that most tuples do not belong in the heap (this will require more
than one comparison per tuple when the tuple that may be entered into
the heap will in fact go in the heap, which should be fairly rare
after a time). That's why that general strategy can be so much faster,
of course.

Note that that heap is "reversed" -- the sort order is inverted, so
that we can use a minheap. The top of the heap is the most marginal
tuple in the top-n heap so far, and so is the next to be removed from
consideration entirely (not the next to be returned to caller, when
merging).

Anyway, I just don't think that this is important enough to change --
it couldn't possibly be worth much of any risk. I can see the appeal
of consistency, but I also see the appeal of sticking to how things
work there: continually and explicitly inserting into and compacting
the heap seems like a good enough way of framing what a top-n heap
does, since there are no groupings of tuples (tapes) involved there.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Claudio Freire

Date:

07 September 2016, 00:50:48

On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I noticed, but here n = state->memtupcount
>>
>> +       Assert(memtuples[0].tupindex == newtup->tupindex);
>> +
>> +       CHECK_FOR_INTERRUPTS();
>> +
>> +       n = state->memtupcount;                 /* n is heap's size,
>> including old root */
>> +       imin = 0;                                               /*
>> start with caller's "hole" in root */
>> +       i = imin;
>
> I'm fine with using "n" in the later assertion you mentioned, if
> that's clearer to you. memtupcount is broken out as "n" simply because
> that's less verbose, in a place where that makes things far clearer.
>
>> In fact, the assert on the patch would allow writing memtuples outside
>> the heap, as in calling tuplesort_heap_root_displace if
>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>> == memtuples[imin] would be outside the heap).
>
> You have to have a valid heap (i.e. there must be at least one
> element) to call tuplesort_heap_root_displace(), and it doesn't
> directly compact the heap, so it must remain valid on return. The
> assertion exists to make sure that everything is okay with a
> one-element heap, a case which is quite possible.

More than using "n" or "memtupcount" what I'm saying is to assert that
memtuples[imin] is inside the heap, which would catch the same errors
the original assert would, and more.

Assert(imin < state->memtupcount)

If you prefer.

The original asserts allows any value of imin for memtupcount>1, and
that's my main concern. It shouldn't.


On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Sure, that's a weird enough case (that assert up there already reads
>> memtuples[0] which would be equally illegal if memtupcount==0), but it
>> goes on to show that the assert expression just seems odd for its
>> intent.
>>
>> BTW, I know it's not the scope of the patch, but shouldn't
>> root_displace be usable on the TSS_BOUNDED phase?
>
> I don't think it should be, no. With a top-n heap sort, the
> expectation is that after a little while, we can immediately determine
> that most tuples do not belong in the heap (this will require more
> than one comparison per tuple when the tuple that may be entered into
> the heap will in fact go in the heap, which should be fairly rare
> after a time). That's why that general strategy can be so much faster,
> of course.

I wasn't proposing getting rid of that optimization, but just
replacing the siftup+insert step with root_displace...

> Note that that heap is "reversed" -- the sort order is inverted, so
> that we can use a minheap. The top of the heap is the most marginal
> tuple in the top-n heap so far, and so is the next to be removed from
> consideration entirely (not the next to be returned to caller, when
> merging).

...but I didn't pause to consider that point.

It still looks like a valid optimization, instead rearranging the heap
twice (siftup + insert), do it once (replace + relocate).

However, I agree that it's not worth the risk conflating the two
optimizations. That one can be done later as a separate patch.

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 September 2016, 00:52:32

On Tue, Sep 6, 2016 at 5:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> However, I agree that it's not worth the risk conflating the two
> optimizations. That one can be done later as a separate patch.

I'm rather fond of the assertions about tape number that exist within
root_displace currently. But, yeah, maybe.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

07 September 2016, 05:28:12

On 09/06/2016 10:42 PM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>> I attach a patch that changes how we maintain the heap invariant
>>>> during tuplesort merging.
>>
>>> Nice!
>>
>> Thanks!
>
> BTW, the way that k-way merging is made more efficient by this
> approach makes the case for replacement selection even weaker than it
> was just before we almost killed it.

This also makes the replacement selection cheaper, no?

> I hate to say it, but I have to
> wonder if we shouldn't get rid of the new-to-9.6
> replacement_sort_tuples because of this, and completely kill
> replacement selection. I'm not going to go on about it, but that seems
> sensible to me.

Yeah, perhaps. But that's a different story.

- Heikki

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 September 2016, 05:37:01

On Tue, Sep 6, 2016 at 10:28 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> BTW, the way that k-way merging is made more efficient by this
>> approach makes the case for replacement selection even weaker than it
>> was just before we almost killed it.
>
>
> This also makes the replacement selection cheaper, no?

Well, maybe, but the whole idea behind replacement_sort_tuples (by
which I mean the continued occasional use of replacement selection by
Postgres) was that we hope to avoid a merge step *entirely*. This new
merge shift down heap patch could make the merge step so cheap as to
be next to free anyway (in the even of presorted input), so the
argument for replacement_sort_tuples is weakened further. It might
always be cheaper once you factor in that the TSS_SORTEDONTAPE path
for returning tuples to caller happens to not be able to use batch
memory, even with something like collated text. And, as a bonus, you
get something that works just as well with an inverse correlation,
which was traditionally the worst case for replacement selection (it
makes it produce runs no larger than those produced by quicksort).

Anyway, I only mention this because it occurs to me. I have no desire
to go back to talking about replacement selection either. Maybe it's
useful to point this out, because it makes it clearer still that
severely limiting the use of replacement selection in 9.6 was totally
justified.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 September 2016, 05:40:28

On Tue, Sep 6, 2016 at 10:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Well, maybe, but the whole idea behind replacement_sort_tuples (by
> which I mean the continued occasional use of replacement selection by
> Postgres) was that we hope to avoid a merge step *entirely*. This new
> merge shift down heap patch could make the merge step so cheap as to
> be next to free anyway (in the even of presorted input)

I mean: Cheaper than just processing the tuples to return to caller
without comparisons/merging (within the TSS_SORTEDONTAPE path). I do
not mean free in an absolute sense, of course.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

07 September 2016, 05:51:12

On 09/07/2016 12:46 AM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Why do we reserve the buffer space for all the tapes right at the beginning?
>> Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin
>> inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we
>> start a new run, until we reach maxTapes?
>
> No, because then you have no way to clamp back memory, which is now
> almost all used (we hold off from making LACKMEM() continually true,
> if at all possible, which is almost always the case). You can't really
> continually shrink memtuples to make space for new tapes, which is
> what it would take.

I still don't get it. When building the initial runs, we don't need 
buffer space for maxTapes yet, because we're only writing to a single 
tape at a time. An unused tape shouldn't take much memory. In 
inittapes(), when we have built all the runs, we know how many tapes we 
actually needed, and we can allocate the buffer memory accordingly.

[thinks a bit, looks at logtape.c]. Hmm, I guess that's wrong, because 
of the way this all is implemented. When we're building the initial 
runs, we're only writing to one tape at a time, but logtape.c 
nevertheless holds onto a BLCKSZ'd currentBuffer, plus one buffer for 
each indirect level, for every tape that has been used so far. What if 
we changed LogicalTapeRewind to free those buffers? Flush out the 
indirect buffers to disk, remembering just the physical block number of 
the topmost indirect block in memory, and free currentBuffer. That way, 
a tape that has been used, but isn't being read or written to at the 
moment, would take very little memory, and we wouldn't need to reserve 
space for them in the build-runs phase.

- Heikki

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 September 2016, 05:58:04

On Tue, Sep 6, 2016 at 10:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I still don't get it. When building the initial runs, we don't need buffer
> space for maxTapes yet, because we're only writing to a single tape at a
> time. An unused tape shouldn't take much memory. In inittapes(), when we
> have built all the runs, we know how many tapes we actually needed, and we
> can allocate the buffer memory accordingly.

Right. That's correct. But, we're not concerned about physically
allocated memory, but rather logically allocated memory (i.e., what
goes into USEMEM()). tuplesort.c should be able to fully use the
workMem specified by caller in the event of an external sort, just as
with an internal sort.

> [thinks a bit, looks at logtape.c]. Hmm, I guess that's wrong, because of
> the way this all is implemented. When we're building the initial runs, we're
> only writing to one tape at a time, but logtape.c nevertheless holds onto a
> BLCKSZ'd currentBuffer, plus one buffer for each indirect level, for every
> tape that has been used so far. What if we changed LogicalTapeRewind to free
> those buffers?

There isn't much point in that, because those buffers are never
physically allocated in the first place when there are thousands. They
are, however, entered into the tuplesort.c accounting as if they were,
denying tuplesort.c the full benefit of available workMem. It doesn't
matter if you USEMEM() or FREEMEM() after we first spill to disk, but
before we begin the merge. (We already refund the
unused-but-logically-allocated memory from unusued at the beginning of
the merge (within beginmerge()), so we can't do any better than we
already are from that point on -- that makes the batch memtuples
growth thing slightly more effective.)

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 September 2016, 06:01:07

On Tue, Sep 6, 2016 at 10:57 PM, Peter Geoghegan <pg@heroku.com> wrote:
> There isn't much point in that, because those buffers are never
> physically allocated in the first place when there are thousands. They
> are, however, entered into the tuplesort.c accounting as if they were,
> denying tuplesort.c the full benefit of available workMem. It doesn't
> matter if you USEMEM() or FREEMEM() after we first spill to disk, but
> before we begin the merge. (We already refund the
> unused-but-logically-allocated memory from unusued at the beginning of
> the merge (within beginmerge()), so we can't do any better than we
> already are from that point on -- that makes the batch memtuples
> growth thing slightly more effective.)

The big picture here is that you can't only USEMEM() for tapes as the
need arises for new tapes as new runs are created. You'll just run a
massive availMem deficit, that you have no way of paying back, because
you can't "liquidate assets to pay off your creditors" (e.g., release
a bit of the memtuples memory). The fact is that memtuples growth
doesn't work that way. The memtuples array never shrinks.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

07 September 2016, 06:09:48

On 09/07/2016 09:01 AM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 10:57 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> There isn't much point in that, because those buffers are never
>> physically allocated in the first place when there are thousands. They
>> are, however, entered into the tuplesort.c accounting as if they were,
>> denying tuplesort.c the full benefit of available workMem. It doesn't
>> matter if you USEMEM() or FREEMEM() after we first spill to disk, but
>> before we begin the merge. (We already refund the
>> unused-but-logically-allocated memory from unusued at the beginning of
>> the merge (within beginmerge()), so we can't do any better than we
>> already are from that point on -- that makes the batch memtuples
>> growth thing slightly more effective.)
>
> The big picture here is that you can't only USEMEM() for tapes as the
> need arises for new tapes as new runs are created. You'll just run a
> massive availMem deficit, that you have no way of paying back, because
> you can't "liquidate assets to pay off your creditors" (e.g., release
> a bit of the memtuples memory). The fact is that memtuples growth
> doesn't work that way. The memtuples array never shrinks.

Hmm. But memtuples is empty, just after we have built the initial runs. 
Why couldn't we shrink, i.e. free and reallocate, it?

- Heikki

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 September 2016, 06:18:06

On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> The big picture here is that you can't only USEMEM() for tapes as the
>> need arises for new tapes as new runs are created. You'll just run a
>> massive availMem deficit, that you have no way of paying back, because
>> you can't "liquidate assets to pay off your creditors" (e.g., release
>> a bit of the memtuples memory). The fact is that memtuples growth
>> doesn't work that way. The memtuples array never shrinks.
>
>
> Hmm. But memtuples is empty, just after we have built the initial runs. Why
> couldn't we shrink, i.e. free and reallocate, it?

After we've built the initial runs, we do in fact give a FREEMEM()
refund to those tapes that were not used within beginmerge(), as I
mentioned just now (with a high workMem, this is often the great
majority of many thousands of logical tapes -- that's how you get to
wasting 8% of 5GB of maintenance_work_mem).

What's at issue with this 500 tapes cap patch is what happens after
tuples are first dumped (after we decide that this is going to be an
external sort -- where we call tuplesort_merge_order() to get the
number of logical tapes in the tapeset), but before the final merge
happens, where we're already doing the right thing for merging by
giving that refund. I want to stop logical allocation (USEMEM()) of an
enormous number of tapes, to make run generation itself able to use
more memory.

It's surprisingly difficult to do something cleverer than just impose a cap.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

07 September 2016, 21:36:29

On 09/07/2016 09:17 AM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> The big picture here is that you can't only USEMEM() for tapes as the
>>> need arises for new tapes as new runs are created. You'll just run a
>>> massive availMem deficit, that you have no way of paying back, because
>>> you can't "liquidate assets to pay off your creditors" (e.g., release
>>> a bit of the memtuples memory). The fact is that memtuples growth
>>> doesn't work that way. The memtuples array never shrinks.
>>
>>
>> Hmm. But memtuples is empty, just after we have built the initial runs. Why
>> couldn't we shrink, i.e. free and reallocate, it?
>
> After we've built the initial runs, we do in fact give a FREEMEM()
> refund to those tapes that were not used within beginmerge(), as I
> mentioned just now (with a high workMem, this is often the great
> majority of many thousands of logical tapes -- that's how you get to
> wasting 8% of 5GB of maintenance_work_mem).

I & peter chatted over IM on this. Let me try to summarize the problems, 
and my plan:

1. When we start to build the initial runs, we currently reserve memory 
for tape buffers, maxTapes * TAPE_BUFFER_OVERHEAD. But we only actually 
need the buffers for tapes that are really used. We "refund" the buffers 
for the unused tapes after we've built the initial runs, but we're still 
wasting that while building the initial runs. We didn't actually 
allocate it, but we could've used it for other things. Peter's solution 
to this was to put a cap on maxTapes.

2. My observation is that during the build-runs phase, you only actually 
need those tape buffers for the one tape you're currently writing to. 
When you switch to a different tape, you could flush and free the 
buffers for the old tape. So reserving maxTapes * TAPE_BUFFER_OVERHEAD 
is excessive, 1 * TAPE_BUFFER_OVERHEAD would be enough. logtape.c 
doesn't have an interface for doing that today, but it wouldn't be hard 
to add.

3. If we do that, we'll still have to reserve the tape buffers for all 
the tapes that we use during merge. So after we've built the initial 
runs, we'll need to reserve memory for those buffers. That might require 
shrinking memtuples. But that's OK: after building the initial runs, 
memtuples is empty, so we can shrink it.

- Heikki

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Claudio Freire

Date:

08 September 2016, 15:53:33

On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> In the meanwhile, I'll go and do some perf testing.
>>
>> Assuming the speedup is realized during testing, LGTM.
>
> Thanks. I suggest spending at least as much time on unsympathetic
> cases (e.g., only 2 or 3 tapes must be merged). At the same time, I
> suggest focusing on a type that has relatively expensive comparisons,
> such as collated text, to make differences clearer.

The tests are still running (the benchmark script I came up with runs
for a lot longer than I anticipated, about 2 days), but preliminar
results are very promising, I can see a clear and consistent speedup.
We'll have to wait for the complete results to see if there's any
significant regression, though. I'll post the full results when I have
them, but until now it all looks like this:

setup:

create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
insert into lotsofitext select cast(random() * 1000000000.0 as text)
|| 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
'blablabl
awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
generate_series(1, 10000000);

timed:

select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;

Unpatched Time: 100351.251 ms
Patched Time: 75180.787 ms

That's like a 25% speedup on random input. As we say over here, rather
badly translated, not a turkey's boogers (meaning "nice!")


On Tue, Sep 6, 2016 at 9:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> I noticed, but here n = state->memtupcount
>>>
>>> +       Assert(memtuples[0].tupindex == newtup->tupindex);
>>> +
>>> +       CHECK_FOR_INTERRUPTS();
>>> +
>>> +       n = state->memtupcount;                 /* n is heap's size,
>>> including old root */
>>> +       imin = 0;                                               /*
>>> start with caller's "hole" in root */
>>> +       i = imin;
>>
>> I'm fine with using "n" in the later assertion you mentioned, if
>> that's clearer to you. memtupcount is broken out as "n" simply because
>> that's less verbose, in a place where that makes things far clearer.
>>
>>> In fact, the assert on the patch would allow writing memtuples outside
>>> the heap, as in calling tuplesort_heap_root_displace if
>>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>>> == memtuples[imin] would be outside the heap).
>>
>> You have to have a valid heap (i.e. there must be at least one
>> element) to call tuplesort_heap_root_displace(), and it doesn't
>> directly compact the heap, so it must remain valid on return. The
>> assertion exists to make sure that everything is okay with a
>> one-element heap, a case which is quite possible.
>
> More than using "n" or "memtupcount" what I'm saying is to assert that
> memtuples[imin] is inside the heap, which would catch the same errors
> the original assert would, and more.
>
> Assert(imin < state->memtupcount)
>
> If you prefer.
>
> The original asserts allows any value of imin for memtupcount>1, and
> that's my main concern. It shouldn't.

So, for the assertions to properly avoid clobbering/reading out of
bounds memory, you need both the above assert:
+                */+               memtuples[i] = memtuples[imin];+               i = imin;+       }+
>+       Assert(imin < state->memtupcount);+       memtuples[imin] = *newtup;+}

And another one at the beginning, asserting:
+       SortTuple  *memtuples = state->memtuples;+       int             n,+                               imin,+
                       i;+
 
>+       Assert(state->memtupcount > 0 && memtuples[0].tupindex == newtup->tupindex);++       CHECK_FOR_INTERRUPTS();

It's worth making that change, IMHO, unless I'm missing something.

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

08 September 2016, 17:13:26

On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> setup:
>
> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
> insert into lotsofitext select cast(random() * 1000000000.0 as text)
> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
> 'blablabl
> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
> generate_series(1, 10000000);
>
> timed:
>
> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>
> Unpatched Time: 100351.251 ms
> Patched Time: 75180.787 ms
>
> That's like a 25% speedup on random input. As we say over here, rather
> badly translated, not a turkey's boogers (meaning "nice!")

Cool! What work_mem setting were you using here?

>> More than using "n" or "memtupcount" what I'm saying is to assert that
>> memtuples[imin] is inside the heap, which would catch the same errors
>> the original assert would, and more.
>>
>> Assert(imin < state->memtupcount)
>>
>> If you prefer.
>>
>> The original asserts allows any value of imin for memtupcount>1, and
>> that's my main concern. It shouldn't.
>
> So, for the assertions to properly avoid clobbering/reading out of
> bounds memory, you need both the above assert:
>
>  +                */
>  +               memtuples[i] = memtuples[imin];
>  +               i = imin;
>  +       }
>  +
>>+       Assert(imin < state->memtupcount);
>  +       memtuples[imin] = *newtup;
>  +}
>
> And another one at the beginning, asserting:
>
>  +       SortTuple  *memtuples = state->memtuples;
>  +       int             n,
>  +                               imin,
>  +                               i;
>  +
>>+       Assert(state->memtupcount > 0 && memtuples[0].tupindex == newtup->tupindex);
>  +
>  +       CHECK_FOR_INTERRUPTS();
>
> It's worth making that change, IMHO, unless I'm missing something.

You're supposed to just not call it with an empty heap, so the
assertions trust that much. I'll look into that.

Currently, producing a new revision of this entire patchset. Improving
the cost model (used when the parallel_workers storage parameter is
not specified within CREATE INDEX) is taking a bit of time, but hope
to have it out in the next couple of days.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Claudio Freire

Date:

08 September 2016, 17:18:55

On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> setup:
>>
>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
>> insert into lotsofitext select cast(random() * 1000000000.0 as text)
>> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
>> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
>> 'blablabl
>> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
>> generate_series(1, 10000000);
>>
>> timed:
>>
>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>>
>> Unpatched Time: 100351.251 ms
>> Patched Time: 75180.787 ms
>>
>> That's like a 25% speedup on random input. As we say over here, rather
>> badly translated, not a turkey's boogers (meaning "nice!")
>
> Cool! What work_mem setting were you using here?

The script iterates over a few variations of string patterns (easy
comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB,
4GB), and table sizes (~350M, ~650M, ~1.5G).

That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

08 September 2016, 17:23:22

On Thu, Sep 8, 2016 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.

Cool. I wonder where this leaves Heikki's draft patch, that completely
removes batch memory, etc.



-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

09 September 2016, 23:00:10

On Wed, Sep 7, 2016 at 2:36 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> 3. If we do that, we'll still have to reserve the tape buffers for all the
> tapes that we use during merge. So after we've built the initial runs, we'll
> need to reserve memory for those buffers. That might require shrinking
> memtuples. But that's OK: after building the initial runs, memtuples is
> empty, so we can shrink it.

Do you really think all this is worth the effort? Given how things are
going to improve for merging anyway, I tend to doubt it. I'd rather
just apply the cap (not necessarily 501 tapes, but something), and be
done with it. As you know, Knuth never advocated more than 7 tapes at
once, which I don't think had anything to do with the economics of
tape drives in the 1970s (or problems with tape operators getting
repetitive strange injuries). There is a chart in volume 3 about this.
Senior hackers talked about a cap like this from day one, back in
2006, when Simon and Tom initially worked on scaling the number of
tapes. Alternatively, we could make MERGE_BUFFER_SIZE much larger,
which I think would be a good idea independent of whatever waste
logically allocation of never-used tapes presents us with. It's
currently 1/4 of 1MiB, which is hardly anything these days, and
doesn't seem to have much to do with OS read ahead trigger sizes.

If we were going to do something like you describe here, I'd prefer it
to be driven by an observable benefit in performance, rather than a
theoretical benefit. Not doing everything in one pass isn't
necessarily worse than having a less cache efficient heap -- it might
be quite a bit better, in fact. You've seen how hard it can be to get
a sort that is I/O bound. (Sorting will tend to not be completely I/O
bound, unless perhaps parallelism is used).

Anyway, this patch (patch 0001-*) is by far the least important of the
3 that you and Claudio are signed up to review. I don't think it's
worth bending over backwards to do better. If you're not comfortable
with a simple cap like this, than I'd suggest that we leave it at
that, since our time is better spent elsewhere. We can just shelve it
for now -- "returned with feedback". I wouldn't make any noise about
it (although, I actually don't think that the cap idea is at all
controversial).

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Claudio Freire

Date:

10 September 2016, 00:22:17

On Thu, Sep 8, 2016 at 2:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> setup:
>>>
>>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
>>> insert into lotsofitext select cast(random() * 1000000000.0 as text)
>>> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
>>> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
>>> 'blablabl
>>> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
>>> generate_series(1, 10000000);
>>>
>>> timed:
>>>
>>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>>>
>>> Unpatched Time: 100351.251 ms
>>> Patched Time: 75180.787 ms
>>>
>>> That's like a 25% speedup on random input. As we say over here, rather
>>> badly translated, not a turkey's boogers (meaning "nice!")
>>
>> Cool! What work_mem setting were you using here?
>
> The script iterates over a few variations of string patterns (easy
> comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB,
> 4GB), and table sizes (~350M, ~650M, ~1.5G).
>
> That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.

Well, the worst regression I see is under the noise for this test
(which seems rather high at 5%, but it's expectable since it's mostly
big queries).

Most samples show an improvement, either marginal or significant. The
most improvement is, naturally, on low work_mem settings. I don't see
significant slowdown on work_mem settings that should result in just a
few tapes being merged, but I didn't instrument to check how many
tapes were being merged in any case.

Attached are the results both in ods, csv and raw formats.

I think these are good results.

So, to summarize the review:

- Patch seems to follow the coding conventions of surrounding code
- Applies cleanly on top of25794e841e5b86a0f90fac7f7f851e5d950e51e2,
plus patches 1 and 2.
- Builds without warnings
- Passes regression tests
- IMO has sufficient coverage from existing tests (none added)
- Does not introduce any significant performance regression
- Best improvement of 67% (reduction of runtime to 59%)
- Average improvement of 30% (reduction of runtime to 77%)
- Worst regression of 5% (increase of runtime to 105%), which is under
the noise for control queries, so not significant
- Performance improvement is highly quite desirable in this merge
step, as it's a big bottleneck on parallel sort (and seems also
regular sort)
- All testing was done on random input, presorted input *will* show
more pronounced improvements

I suggested to change a few asserts in tuplesort_heap_root_displace to
make the debug code stricter in checking the assumptions, but they're
not blockers:

+       Assert(state->memtupcount > 1 || imin == 0);
+       memtuples[imin] = *newtup;

Into

+       Assert(imin < state->memtupcount);
+       memtuples[imin] = *newtup;

And, perhaps as well,

+       Assert(memtuples[0].tupindex == newtup->tupindex);
+
+       CHECK_FOR_INTERRUPTS();

into

+       Assert(state->memtupcount > 0 && memtuples[0].tupindex ==
newtup->tupindex);
+
+       CHECK_FOR_INTERRUPTS();

It was suggested that both tuplesort_heap_siftup and
tuplesort_heap_root_displace could be wrappers around a common
"siftup" implementation, since the underlying operation is very
similar.

Since it is true that doing so would make it impossible to keep the
asserts about tupindex in tuplesort_heap_root_displace, I guess it
depends on how useful those asserts are (ie: how likely it is that
those conditions could be violated, and how damaging it could be if
they were). If it is decided the refactor is desirable, I'd suggest
making the common siftup producedure static inline, to allow
tuplesort_heap_root_displace to inline and specialize it, since it
will be called with checkIndex=False and that simplifies the resulting
code considerably.

Peter also mentioned that there were some other changes going on in
the surrounding code that could impact this patch, so I'm marking the
patch Waiting on Author.

Overall, however, I believe the patch is in good shape. Only minor
form issues need to be changed, the functionality seems both desirable
and ready.

On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Pushed this "displace root" patch, with some changes:

Attached is rebased version of the entire patch series, which should
be applied on top of what you pushed to the master branch today.

This features a new scheme for managing workMem --
maintenance_work_mem is now treated as a high watermark/budget for the
entire CREATE INDEX operation, regardless of the number of workers.
This seems to work much better, so Robert was right to suggest it.

There were also improvements to the cost model, to weigh available
maintenance_work_mem under this new system. And, the cost model was
moved inside planner.c (next to plan_cluster_use_sort()), which is
really where it belongs. The cost model is still WIP, though, and I
didn't address some concerns of my own about how tuplesort.c
coordinates workers. I think that Robert's "condition variables" will
end up superseding that stuff anyway. And, I think that this v2 will
bitrot fairly soon, when Heikki commits what is in effect his version
of my 0002-* patch (that's unchanged, if only because it refactors
some things that the parallel CREATE INDEX patch is reliant on).

So, while there are still a few loose ends with this revision (it
should still certainly be considered WIP), I wanted to get a revision
out quickly because V1 has been left to bitrot for too long now, and
my schedule is very full for the next week, ahead of my leaving to go
on vacation (which is long overdue). Hopefully, I'll be able to get
out a third revision next Saturday, on top of the
by-then-presumably-committed new tape batch memory patch from Heikki,
just before I leave. I'd rather leave with a patch available that can
be cleanly applied, to make review as easy as possible, since it
wouldn't be great to have this V2 with bitrot for 10 days or more.

--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

13 September 2016, 13:26:08

On Sun, Sep 11, 2016 at 2:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Pushed this "displace root" patch, with some changes:
>
> Attached is rebased version of the entire patch series, which should
> be applied on top of what you pushed to the master branch today.

0003 looks like a sensible cleanup of our #include structure
regardless of anything this patch series is trying to accomplish, so
I've committed it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

21 September 2016, 16:52:10

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> Tape unification
> ----------------
>
> Sort operations have a unique identifier, generated before any workers
> are launched, using a scheme based on the leader's PID, and a unique
> temp file number. This makes all on-disk state (temp files managed by
> logtape.c) discoverable by the leader process. State in shared memory
> is sized in proportion to the number of workers, so the only thing
> about the data being sorted that gets passed around in shared memory
> is a little logtape.c metadata for tapes, describing for example how
> large each constituent BufFile is (a BufFile associated with one
> particular worker's tapeset).
>
> (See below also for notes on buffile.c's role in all of this, fd.c and
> resource management, etc.)
>
 > ...
>
> buffile.c, and "unification"
> ============================
>
> There has been significant new infrastructure added to make logtape.c
> aware of workers. buffile.c has in turn been taught about unification
> as a first class part of the abstraction, with low-level management of
> certain details occurring within fd.c. So, "tape unification" within
> processes to open other backend's logical tapes to generate a unified
> logical tapeset for the leader to merge is added. This is probably the
> single biggest source of complexity for the patch, since I must
> consider:
>
> * Creating a general, reusable abstraction for other possible BufFile
> users (logtape.c only has to serve tuplesort.c, though).
>
> * Logical tape free space management.
>
> * Resource management, file lifetime, etc. fd.c resource management
> can now close a file at xact end for temp files, while not deleting it
> in the leader backend (only the "owning" worker backend deletes the
> temp file it owns).
>
> * Crash safety (e.g., when to truncate existing temp files, and when not to).

I find this unification business really complicated. I think it'd be
simpler to keep the BufFiles and LogicalTapeSets separate, and instead
teach tuplesort.c how to merge tapes that live on different
LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
LogicalTapeSet can contain tapes from different underlying BufFiles.

What I have in mind is something like the attached patch. It refactors
LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a
LogicalTape as argument, instead of LogicalTapeSet and tape number.
LogicalTapeSet doesn't have the concept of a tape number anymore, it can
contain any number of tapes, and you can create more on the fly. With
that, it'd be fairly easy to make tuplesort.c merge LogicalTapes that
came from different tape sets, backed by different BufFiles. I think
that'd avoid much of the unification code.

That leaves one problem, though: reusing space in the final merge phase.
If the tapes being merged belong to different LogicalTapeSets, and
create one new tape to hold the result, the new tape cannot easily reuse
the space of the input tapes because they are on different tape sets.
But looking at your patch, ISTM you actually dodged that problem as well:

> +     * As a consequence of only being permitted to write to the leader
> +     * controlled range, parallel sorts that require a final materialized tape
> +     * will use approximately twice the disk space for temp files compared to
> +     * a more or less equivalent serial sort.  This is deemed acceptable,
> +     * since it is far rarer in practice for parallel sort operations to
> +     * require a final materialized output tape.  Note that this does not
> +     * apply to any merge process required by workers, which may reuse space
> +     * eagerly, just like conventional serial external sorts, and so
> +     * typically, parallel sorts consume approximately the same amount of disk
> +     * blocks as a more or less equivalent serial sort, even when workers must
> +     * perform some merging to produce input to the leader.

I'm slightly worried about that. Maybe it's OK for a first version, but
it'd be annoying in a query where a sort is below a merge join, for
example, so that you can't do the final merge on the fly because
mark/restore support is needed.

One way to fix that would be have all the parallel works share the work
files to begin with, and keep the "nFileBlocks" value in shared memory
so that the workers won't overlap each other. Then all the blocks from
different workers would be mixed together, though, which would hurt the
sequential pattern of the tapes, so each workers would need to allocate
larger chunks to avoid that.

- Heikki

Attachment

0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

22 September 2016, 19:51:58

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> No merging in parallel
> ----------------------
>
> Currently, merging worker *output* runs may only occur in the leader
> process. In other words, we always keep n worker processes busy with
> scanning-and-sorting (and maybe some merging), but then all processes
> but the leader process grind to a halt (note that the leader process
> can participate as a scan-and-sort tuplesort worker, just as it will
> everywhere else, which is why I specified "parallel_workers = 7" but
> talked about 8 workers).
>
> One leader process is kept busy with merging these n output runs on
> the fly, so things will bottleneck on that, which you saw in the
> example above. As already described, workers will sometimes merge in
> parallel, but only their own runs -- never another worker's runs. I
> did attempt to address the leader merge bottleneck by implementing
> cross-worker run merging in workers. I got as far as implementing a
> very rough version of this, but initial results were disappointing,
> and so that was not pursued further than the experimentation stage.
>
> Parallel merging is a possible future improvement that could be added
> to what I've come up with, but I don't think that it will move the
> needle in a really noticeable way.

It'd be good if you could overlap the final merges in the workers with 
the merge in the leader. ISTM it would be quite straightforward to 
replace the final tape of each worker with a shared memory queue, so 
that the leader could start merging and returning tuples as soon as it 
gets the first tuple from each worker. Instead of having to wait for all 
the workers to complete first.

- Heikki

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

22 September 2016, 19:57:57

On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It'd be good if you could overlap the final merges in the workers with the
> merge in the leader. ISTM it would be quite straightforward to replace the
> final tape of each worker with a shared memory queue, so that the leader
> could start merging and returning tuples as soon as it gets the first tuple
> from each worker. Instead of having to wait for all the workers to complete
> first.

If you do that, make sure to have the leader read multiple tuples at a
time from each worker whenever possible.  It makes a huge difference
to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

24 September 2016, 13:03:14

On Wed, Sep 21, 2016 at 5:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I find this unification business really complicated.

I can certainly understand why you would. As I said, it's the most
complicated part of the patch, which overall is one of the most
ambitious patches I've ever written.

> I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I think that it won't be possible to make a LogicalTapeSet ever use
more than one BufFile without regressing the ability to eagerly reuse
space, which is almost the entire reason for logtape.c existing. The
whole indirect block thing is an idea borrowed from the FS world, of
course, and so logtape.c needs one block-device-like BufFile, with
blocks that can be reclaimed eagerly, but consumed for recycling in
*contiguous* order (which is why they're sorted using qsort() within
ltsGetFreeBlock()). You're going to increase the amount of random I/O
by using more than one BufFile for an entire tapeset, I think.

This patch you posted
("0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch") just
keeps one BufFile, and only changes the interface to expose the tapes
themselves to tuplesort.c, without actually making tuplesort.c do
anything with that capability. I see what you're getting at, I think,
but I don't see how that accomplishes all that much for parallel
CREATE INDEX. I mean, the special case of having multiple tapesets
from workers (not one "unified" tapeset created from worker temp files
from their tapesets to begin with) now needs special treatment.
Haven't you just moved the complexity around (once your patch is made
to care about parallelism)? Having multiple entire tapesets explicitly
from workers, with their own BufFiles, is not clearly less complicated
than managing ranges from BufFile fd.c files with delineated ranges of
"logical tapeset space". Seems almost equivalent, except that my way
doesn't bother tuplesort.c with any of this.

>> +        * As a consequence of only being permitted to write to the leader
>> +        * controlled range, parallel sorts that require a final
>> materialized tape
>> +        * will use approximately twice the disk space for temp files
>> compared to
>> +        * a more or less equivalent serial sort.

> I'm slightly worried about that. Maybe it's OK for a first version, but it'd
> be annoying in a query where a sort is below a merge join, for example, so
> that you can't do the final merge on the fly because mark/restore support is
> needed.

My intuition is that we'll *never* end up using this for merge joins.
I think that I could do better here (why should workers really care at
this point?), but just haven't bothered to.

This parallel sort implementation is something written with CREATE
INDEX and CLUSTER in mind only (maybe one or two other things, too). I
believe that for query execution, partitioning is the future [1]. With
merge joins, partitioning is desirable because it lets you push down
*everything* to workers, not just sorting (e.g., by aligning
partitioning boundaries on each side of each merge join sort in the
worker, and having the worker also "synchronize" each side of the
join, all independently and without a dependency on a final merge).

That's why I think it's okay that I use twice as much space for
randomAccess tuplesort.c callers. No real world caller will ever end
up needing to do this. It just seems like a good idea to support
randomAccess when using this new infrastructure, on general principle.
Forcing myself to support that case during initial development
actually resulted in much cleaner, less invasive changes to
tuplesort.c in general.

[1]
https://www.postgresql.org/message-id/flat/CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com#CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com
-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

24 September 2016, 13:07:28

On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> It'd be good if you could overlap the final merges in the workers with the
>> merge in the leader. ISTM it would be quite straightforward to replace the
>> final tape of each worker with a shared memory queue, so that the leader
>> could start merging and returning tuples as soon as it gets the first tuple
>> from each worker. Instead of having to wait for all the workers to complete
>> first.
>
> If you do that, make sure to have the leader read multiple tuples at a
> time from each worker whenever possible.  It makes a huge difference
> to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.

That requires some kind of mutual exclusion mechanism, like an LWLock.
It's possible that merging everything lazily is actually the faster
approach, given this, and given the likely bottleneck on I/O at htis
stage. It's also certainly simpler to not overlap things. This is
something I've read about before [1], with "eager evaluation" sorting
not necessarily coming out ahead IIRC.

[1] http://digitalcommons.ohsu.edu/cgi/viewcontent.cgi?article=1193&context=csetech
-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

26 September 2016, 17:58:36

On Sat, Sep 24, 2016 at 9:07 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> It'd be good if you could overlap the final merges in the workers with the
>>> merge in the leader. ISTM it would be quite straightforward to replace the
>>> final tape of each worker with a shared memory queue, so that the leader
>>> could start merging and returning tuples as soon as it gets the first tuple
>>> from each worker. Instead of having to wait for all the workers to complete
>>> first.
>>
>> If you do that, make sure to have the leader read multiple tuples at a
>> time from each worker whenever possible.  It makes a huge difference
>> to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.
>
> That requires some kind of mutual exclusion mechanism, like an LWLock.

No, it doesn't.  Shared memory queues are single-reader, single-writer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

26 September 2016, 19:40:10

On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That requires some kind of mutual exclusion mechanism, like an LWLock.
>
> No, it doesn't.  Shared memory queues are single-reader, single-writer.

The point is that there is a natural dependency when merging is
performed eagerly within the leader. One thing needs to be in lockstep
with the others. That's all.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

27 September 2016, 14:08:20

On Mon, Sep 26, 2016 at 3:40 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> That requires some kind of mutual exclusion mechanism, like an LWLock.
>>
>> No, it doesn't.  Shared memory queues are single-reader, single-writer.
>
> The point is that there is a natural dependency when merging is
> performed eagerly within the leader. One thing needs to be in lockstep
> with the others. That's all.

I don't know what any of that means.  You said we need something like
an LWLock, but I think we don't.  The workers just write the results
of their own final merges into shm_mqs.  The leader can read from any
given shm_mq until no more tuples can be read without blocking, just
like nodeGather.c does, or at least it can do that unless its own
queue fills up first.  No mutual exclusion mechanism is required for
any of that, as far as I can see - not an LWLock, and not anything
similar.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

08 October 2016, 00:47:26

On Sun, Sep 11, 2016 at 11:05 AM, Peter Geoghegan <pg@heroku.com> wrote:
> So, while there are still a few loose ends with this revision (it
> should still certainly be considered WIP), I wanted to get a revision
> out quickly because V1 has been left to bitrot for too long now, and
> my schedule is very full for the next week, ahead of my leaving to go
> on vacation (which is long overdue). Hopefully, I'll be able to get
> out a third revision next Saturday, on top of the
> by-then-presumably-committed new tape batch memory patch from Heikki,
> just before I leave. I'd rather leave with a patch available that can
> be cleanly applied, to make review as easy as possible, since it
> wouldn't be great to have this V2 with bitrot for 10 days or more.

Heikki committed his preload memory patch a little later than
originally expected, 4 days ago. I attach V3 of my own parallel CREATE
INDEX patch, which should be applied on top of a today's git master
(there is a bugfix that reviewers won't want to miss -- commit
b56fb691). I have my own test suite, and have to some extent used TDD
for this patch, so rebasing was not so bad . My tests are rather rough
and ready, so I'm not going to post them here. (Changes in the
WaitLatch() API also caused bitrot, which is now fixed.)

Changes from V2:

* Since Heikki eliminated the need for any extra memtuple "slots"
(memtuples is now only exactly big enough for the initial merge heap),
an awful lot of code could be thrown out that managed sizing memtuples
in the context of the parallel leader (based on trends seen in
parallel workers). I was able to follow Heikki's example by
eliminating code for parallel sorting memtuples sizing. Throwing this
code out let me streamline a lot of stuff within tuplesort.c, which is
cleaned up quite a bit.

* Since this revision was mostly focused on fixing up logtape.c
(rebasing on top of Heikki's work), I also took the time to clarify
some things about how an block-based offset might need to be applied
within the leader. Essentially, outlining how and where that happens,
and where it doesn't and shouldn't happen. (An offset must sometimes
be applied to compensate for difference in logical BufFile positioning
(leader/worker differences) following leader's unification of worker
tapesets into one big tapset of its own.)

* max_parallel_workers_maintenance now supersedes the use of the new
parallel_workers index storage parameter. This matches existing heap
storage parameter behavior, and allows the implementation to add
practically no cycles as compared to master branch when the use of
parallelism is disabled by setting max_parallel_workers_maintenance to
0.

* New additions to the chapter in the documentation that Robert added
a little while back, "Chapter 15. Parallel Query". It's perhaps a bit
of a stretch to call this feature part of parallel query, but I think
that it works reasonably well. The optimizer does determine the number
of workers needed here, so while it doesn't formally produce a query
plan, I think the implication that it does is acceptable for
user-facing documentation. (Actually, it would be nice if you really
could EXPLAIN utility commands -- that would be a handy place to show
information about how they were executed.)

Maybe this new documentation describes things in what some would
consider to be excessive detail for users. The relatively detailed
information added on parallel sorting seemed to be in the pragmatic
spirit of the new chapter 15, so I thought I'd see what people
thought.

Work is still needed on:

* Cost model. Should probably attempt to guess final index size, and
derive calculation of number of workers from that. Also, I'm concerned
that I haven't given enough thought to the low end, where with default
settings most CREATE INDEX statements will use at least one parallel
worker.

* The whole way that I teach nbtsort.c to disallow catalog tables for
parallel CREATE INDEX due to concerns about parallel safety is in need
of expert review, preferably from Robert. It's complicated in a way
that relies on things happening or not happening from a distance.

* Heikki seems to want to change more about logtape.c, and its use of
indirection blocks. That may evolve, but for now I can only target the
master branch.

* More extensive performance testing. I think that this V3 is probably
the fastest version yet, what with Heikki's improvements, but I
haven't really verified that.

--
Peter Geoghegan

On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Work is still needed on:
>
> * Cost model. Should probably attempt to guess final index size, and
> derive calculation of number of workers from that. Also, I'm concerned
> that I haven't given enough thought to the low end, where with default
> settings most CREATE INDEX statements will use at least one parallel
> worker.
>
> * The whole way that I teach nbtsort.c to disallow catalog tables for
> parallel CREATE INDEX due to concerns about parallel safety is in need
> of expert review, preferably from Robert. It's complicated in a way
> that relies on things happening or not happening from a distance.
>
> * Heikki seems to want to change more about logtape.c, and its use of
> indirection blocks. That may evolve, but for now I can only target the
> master branch.
>
> * More extensive performance testing. I think that this V3 is probably
> the fastest version yet, what with Heikki's improvements, but I
> haven't really verified that.

While I haven't made progress on any of these open items, I should
still get a version out that applies cleanly on top of git tip --
commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to
bitrot. I attach V4, which is a fairly mechanical rebase of V3, with
no notable behavioral changes or bug fixes.

--
Peter Geoghegan

On Mon, Oct 24, 2016 at 6:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> * Cost model. Should probably attempt to guess final index size, and
>> derive calculation of number of workers from that. Also, I'm concerned
>> that I haven't given enough thought to the low end, where with default
>> settings most CREATE INDEX statements will use at least one parallel
>> worker.

> While I haven't made progress on any of these open items, I should
> still get a version out that applies cleanly on top of git tip --
> commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to
> bitrot. I attach V4, which is a fairly mechanical rebase of V3, with
> no notable behavioral changes or bug fixes.

I attach V5. Changes:

* A big cost model overhaul. Workers are logarithmically scaled based
on projected final *index* size, not current heap size, as was the
case in V4. A new nbtpage.c routine is added to estimate a
not-yet-built B-Tree index's size, now called by the optimizer. This
involves getting average item width for indexed attributes from
pg_attribute for the heap relation. There are some subtleties here
with partial indexes, null_frac, etc. I also refined the cap applied
on the number of workers that limits too many workers being launched
when there isn't so much maintenance_work_mem.

The cost model is much improved now -- it is now more than just a
placeholder, at least. It doesn't do things like launch a totally
inappropriate number of workers to build a very small partial index.
Granted, those workers would still have something to do -- scan the
heap -- but not enough to justify launching so many (that is,
launching as many as would be launched for an equivalent non-partial
index).

That having been said, things are still quite fudged here, and I
struggle to find any guiding principle around doing better on average.
I think that that's because of the inherent difficulty of modeling
what's going on, but I'd be happy to be proven wrong on that. In any
case, I think it's going to be fairly common for DBAs to want to use
the storage parameter to force the use of a particular number of
parallel workers.

(See also: my remarks below on how the new bt_estimate_nblocks()
SQL-callable function can give insight into the new cost model's
decisions.)

* Overhauled leader_mergeruns() further, to make it closer to
mergeruns(). We now always rewind input tapes. This simplification
involved refining some of the assertions within logtape.c, which is
also slightly simplified.

* 2 new testing tools are added to the final commit in the patch
series (not actually proposed for commit). I've added 2 new
SQL-callable functions to contrib/pageinspect.

The 2 new testing functions are:

bt_estimate_nblocks
-------------------

bt_estimate_nblocks() provides an easy way to see the optimizer's
projection of how large the final index will be. It returns an
estimate in blocks. Example:

mgd=# analyze;
ANALYZE
mgd=# select oid::regclass as rel,
bt_estimated_nblocks(oid),
relpages,
to_char(bt_estimated_nblocks(oid)::numeric / relpages, 'FM990.990') as
estimate_actual
from pg_class
where relkind = 'i'
order by relpages desc limit 20;

rel │
bt_estimated_nblocks │ relpages │ estimate_actual
────────────────────────────────────────────────────┼──────────────────────┼──────────┼─────────────────
mgd.acc_accession_idx_accid │
107,091 │ 106,274 │ 1.008
mgd.acc_accession_0 │
169,024 │ 106,274 │ 1.590
mgd.acc_accession_1 │
169,024 │ 80,382 │ 2.103
mgd.acc_accession_idx_prefixpart │
76,661 │ 80,382 │ 0.954
mgd.acc_accession_idx_mgitype_key │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_clustered │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_createdby_key │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_numericpart │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_logicaldb_key │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_modifiedby_key │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_pkey │
76,661 │ 76,928 │ 0.997
mgd.mgi_relationship_property_idx_propertyname_key │
74,197 │ 74,462 │ 0.996
mgd.mgi_relationship_property_idx_modifiedby_key │
74,197 │ 74,462 │ 0.996
mgd.mgi_relationship_property_pkey │
74,197 │ 74,462 │ 0.996
mgd.mgi_relationship_property_idx_clustered │
74,197 │ 74,462 │ 0.996
mgd.mgi_relationship_property_idx_createdby_key │
74,197 │ 74,462 │ 0.996
mgd.seq_sequence_idx_clustered │
50,051 │ 50,486 │ 0.991
mgd.seq_sequence_raw_pkey │
35,826 │ 35,952 │ 0.996
mgd.seq_sequence_raw_idx_modifiedby_key │
35,826 │ 35,952 │ 0.996
mgd.seq_source_assoc_idx_clustered │
35,822 │ 35,952 │ 0.996
(20 rows)

I haven't tried to make the underlying logic as close to perfect as
possible, but it tends to be accurate in practice, as is evident from
this real-world example (this shows larger indexes following a
restoration of the mouse genome sample database [1]). Perhaps there
could be a role for a refined bt_estimate_nblocks() function in
determining when B-Tree indexes become bloated/unbalanced (maybe have
pgstatindex() estimate index bloat based on a difference between
projected and actual fan-in?). That has nothing to do with parallel
CREATE INDEX, though.

bt_main_forks_identical
-----------------------

bt_main_forks_identical() checks if 2 specific relations have bitwise
identical main forks. If they do, it returns the number of blocks in
the main fork of each. Otherwise, an error is raised.

Unlike any approach involving *writing* the index in parallel (e.g.,
any worthwhile approach based on data partitioning), the proposed
parallel CREATE INDEX implementation creates an identical index
representation to that created by any serial process (including, for
example, the master branch when CREATE INDEX uses an internal sort).
The index that you end up with when parallelism is used ought to be
100% identical in all cases.

(This is true because there is a TID tie-breaker when sorting B-Tree
index tuples, and because LSNs are set to 0 by CREATE INDEX. Why not
exploit that fact to test the implementation?)

If anyone can demonstrate that parallel CREATE INDEX fails to create a
non-bitwise-identical index representation to a "known good"
implementation, or can demonstrate that it doesn't consistently
produce exactly the same final index representation given the same
underlying table as input, then there *must* be a bug.
bt_main_forks_identical() gives reviewers an easy way to verify this,
perhaps just in passing during benchmarking.

pg_restore
==========

It occurs to me that parallel CREATE INDEX receives no special
consideration by pg_restore. This leaves it so that the use of
parallel CREATE INDEX can come down to whether or not
pg_class.reltuples is accidentally updated by something like an
initial CREATE INDEX. This is not ideal. There is also the questions
of how pg_restore -j cases ought to give special consideration to
parallel CREATE INDEX, if at all -- it's probably true that concurrent
index builds on the same relation do go together well with parallel
CREATE INDEX, but even in V5 pg_restore remains totally naive about
this.

That having been said, pg_restore currently does nothing clever with
maintenance_work_mem when multiple jobs are used, even though that
seems at least as useful as what I outline for parallel CREATE INDEX.
It's not clear how to judge this.

What do we need to teach pg_restore about parallel CREATE INDEX, if
anything at all? Could this be as simple as a blanket disabling of
parallelism for CREATE INDEX from pg_restore? Or, does it need to be
more sophisticated than that? I suppose that tools like reindexdb and
pgbench must be considered in a similar way.

Maybe we could get the number of blocks in the heap relation when its
pg_class.reltupes is 0, from the smgr, and then extrapolate the
reltuples using simple, generic logic, in the style of
vac_estimate_reltuples() (its "old_rel_pages" == 0 case). For now,
I've avoided doing that out of concern for the overhead in cases where
there are many small tables to be restored, and because it may be
better to err on the side of not using parallelism.

[1] https://wiki.postgresql.org/wiki/Sample_Databases
--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

10 November 2016, 00:01:38

On Mon, Nov 7, 2016 at 11:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I attach V5.

I gather that 0001, which puts a cap on the number of tapes, is not
actually related to the subject of this thread; it's an independent
change that you think is a good idea.  I reviewed the previous
discussion on this topic upthread, between you and Heikki, which seems
to me to contain more heat than light.  At least in my opinion, the
question is not whether a limit on the number of tapes is the best
possible system, but rather whether it's better than the status quo.
It's silly to refuse to make a simple change on the grounds that some
much more complex change might be better, because if somebody writes
that patch and it is better we can always revert 0001 then.  If 0001
involved hundreds of lines of invasive code changes, that argument
wouldn't apply, but it doesn't; it's almost a one-liner.

Now, on the other hand, as far as I can see, the actual amount of
evidence that 0001 is a good idea which has been presented in this
forum is pretty near zero.  You've argued for it on theoretical
grounds several times, but theoretical arguments are not a substitute
for test results.  Therefore, I decided that the best thing to do was
test it myself.  I wrote a little patch to add a GUC for
max_sort_tapes, which actually turns out not to work as I thought:
setting max_sort_tapes = 501 seems to limit the highest tape number to
501 rather than the number of tapes to 501, so there's a sort of
off-by-one error.  But that doesn't really matter.  The patch is
attached here for the convenience of anyone else who may want to
fiddle with this.

Next, I tried to set things up so that I'd get a large enough number
of tapes for the cap to matter.  To do that, I initialized with
"pgbench -i --unlogged-tables -s 20000" so that I had 2 billion
tuples.  Then I used this SQL query: "select sum(w+abalance) from
(select (aid::numeric * 7123000217)%1000000000 w, * from
pgbench_accounts order by 1) x".  The point of the math is to perturb
the ordering of the tuples so that they actually need to be sorted
instead of just passed through unchanged. The use of abalance in the
outer sum prevents an index-only-scan from being used, which makes the
sort wider; perhaps I should have tried to make it wider still, but
this is what I did.  I wanted to have more than 501 tapes because,
obviously, a concern with a change like this is that things might get
slower in the case where it forces a polyphase merge rather than a
single merge pass. And, of course, I set trace_sort = on.

Here's what my initial run looked like, in brief:

2016-11-09 15:37:52 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 15:37:59 UTC [44026] LOG:  switching to external sort with
937 tapes: CPU: user: 5.51 s, system: 0.27 s, elapsed: 6.56 s
2016-11-09 16:48:31 UTC [44026] LOG:  finished writing run 616 to tape
615: CPU: user: 4029.17 s, system: 152.72 s, elapsed: 4238.54 s
2016-11-09 16:48:31 UTC [44026] LOG:  using 246719 KB of memory for
read buffers among 616 input tapes
2016-11-09 16:48:39 UTC [44026] LOG:  performsort done (except 616-way
final merge): CPU: user: 4030.30 s, system: 152.98 s, elapsed: 4247.41
s
2016-11-09 18:33:30 UTC [44026] LOG:  external sort ended, 6255145
disk blocks used: CPU: user: 10214.64 s, system: 175.24 s, elapsed:
10538.06 s

And according to psql: Time: 10538068.225 ms (02:55:38.068)

Then I set max_sort_tapes = 501 and ran it again.  This time:

2016-11-09 19:05:22 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 19:05:28 UTC [44026] LOG:  switching to external sort with
502 tapes: CPU: user: 5.69 s, system: 0.26 s, elapsed: 6.13 s
2016-11-09 20:15:20 UTC [44026] LOG:  finished writing run 577 to tape
75: CPU: user: 3993.81 s, system: 153.42 s, elapsed: 4198.52 s
2016-11-09 20:15:20 UTC [44026] LOG:  using 249594 KB of memory for
read buffers among 501 input tapes
2016-11-09 20:21:19 UTC [44026] LOG:  finished 77-way merge step: CPU:
user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s
2016-11-09 20:21:19 UTC [44026] LOG:  performsort done (except 501-way
final merge): CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22
s
2016-11-09 21:38:12 UTC [44026] LOG:  external sort ended, 6255484
disk blocks used: CPU: user: 8848.81 s, system: 182.64 s, elapsed:
9170.62 s

And this one, according to psql: Time: 9170629.597 ms (02:32:50.630)

That looks very good.  On a test that runs for almost 3 hours, we
saved more than 20 minutes.  The overall runtime improvement is 23% in
a case where we would not expect this patch to do particularly well;
after all, without limiting the number of runs, we are able to
complete the sort with a single merge pass, whereas when we reduce the
number of runs, we now require a polyphase merge.  Nevertheless, we
come out way ahead, because the final merge pass gets way faster,
presumably because there are fewer tapes involved.  The first test
does a 616-way final merge and takes 6184.34 seconds to do it.  The
second test does a 501-way final merge and takes 4519.31 seconds to
do.  This increased final merge speed accounts for practically all of
the speedup, and the reason it's faster pretty much has to be that
it's merging fewer tapes.

That, in turn, happens for two reasons.  First, because limiting the
number of tapes increases slightly the memory available for storing
the tuples belonging to each run, we end up with fewer runs in the
first place.  The number of runs drops from from 616 to 577, about a
7% reduction.  Second, because we have more runs than tapes in the
second case, it does a 77-way merge prior to the final merge.  Because
of that 77-way merge, the time at which the second run starts
producing tuples is slightly later.  Instead of producing the first
tuple at 70:47.71, we have to wait until 75:72.22.  That's a small
disadvantage in this case, because it's hypothetically possible that a
query like this could have a LIMIT and we'd end up worse off overall.
However, that's pretty unlikely, for three reasons.  Number one, LIMIT
isn't likely to be used on queries of this type in the first place.
Number two, if it were used, we'd probably end up with a bounded sort
plan which would be way faster anyway.  Number three, if somehow we
still sorted the data set we'd still win in this case if the limit
were more than about 20% of the total number of tuples.  The much
faster run time to produce the whole data set is a small price to pay
for possibly needing to wait a little longer for the first tuple.

Admittedly, this is only one test, and some other test might show a
different result.  However, I believe that there aren't likely to be
many losing cases.  If the increased number of tapes doesn't force a
polyphase merge, we're almost certain to win, because in that case the
only thing that changes is that we have more memory with which to
produce each run.  On small sorts, this may not help much, but it
won't hurt.  Even if the increased number of tapes *does* force a
polyphase merge, the reduction in the number of initial runs and/or
the reduction in the number of runs in any single merge may add up to
a win, as in this example.  In fact, it may well be the case that the
optimal number of tapes is significantly less than 501.  It's hard to
tell for sure, but it sure looks like that 77-way non-final merge is
significantly more efficient than the final merge.

So, I'm now feeling pretty bullish about this patch, except for one
thing, which is that I think the comments are way off-base. Peter
writes: $$When allowedMem is significantly lower than what is required
for an internal sort, it is unlikely that there are benefits to
increasing the number of tapes beyond Knuth's "sweet spot" of 7.$$
I'm pretty sure that's totally wrong, first of all because commit
df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
precisely the thing which this comment says we shouldn't, secondly
because 501 is most definitely significantly higher than 7 so the code
and the comment don't even match, and thirdly because, as the comment
added in the commit says, each extra tape doesn't really cost that
much.  In this example, going from 501 tapes up to 937 tapes only
reduces memory available for tuples by about 7%, even though the
number of tapes have almost doubled.  If we had a sort with, say, 30
runs, do we really want to do a polyphase merge just to get a sub-1%
increase in the amount of memory per run?  I doubt it.

Given all that, what I'm inclined to do is rewrite the comment to say,
basically, that even though we can afford lots of tapes, it's better
not to allow too ridiculously many because (1) that eats away at the
amount of memory available for tuples in each initial run and (2) very
high-order final merges are not very efficient.  And then commit that.
If somebody wants to fine-tune the tape limit later after more
extensive testing or replacing it by some other system that is better,
great.

Sound OK?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

max-sort-tapes.patch

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

10 November 2016, 00:54:45

On Wed, Nov 9, 2016 at 4:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I gather that 0001, which puts a cap on the number of tapes, is not
> actually related to the subject of this thread; it's an independent
> change that you think is a good idea.  I reviewed the previous
> discussion on this topic upthread, between you and Heikki, which seems
> to me to contain more heat than light.

FWIW, I don't remember it that way. Heikki seemed to be uncomfortable
with the quasi-arbitrary choice of constant, rather than disagreeing
with the general idea of a cap. Or, maybe he thought I didn't go far
enough, by completely removing polyphase merge. I think that removing
polyphase merge would be an orthogonal change to this, though.

> Now, on the other hand, as far as I can see, the actual amount of
> evidence that 0001 is a good idea which has been presented in this
> forum is pretty near zero.  You've argued for it on theoretical
> grounds several times, but theoretical arguments are not a substitute
> for test results.

See the illustration in TAOCP, vol III, page 273 in the second edition
-- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
that it's actually a real-world benchmark.

I guess I felt that no one ever argued that as many tapes as possible
was sound on any grounds, even theoretical, and so didn't feel
obligated to test it until asked to do so. I think that the reason
that a cap like this didn't go in around the time that the growth
logic went in (2006) was because nobody followed up on it. If you look
at the archives, there is plenty of discussion of a cap like this at
the time.

> That looks very good.  On a test that runs for almost 3 hours, we
> saved more than 20 minutes.  The overall runtime improvement is 23% in
> a case where we would not expect this patch to do particularly well;
> after all, without limiting the number of runs, we are able to
> complete the sort with a single merge pass, whereas when we reduce the
> number of runs, we now require a polyphase merge.  Nevertheless, we
> come out way ahead, because the final merge pass gets way faster,
> presumably because there are fewer tapes involved.  The first test
> does a 616-way final merge and takes 6184.34 seconds to do it.  The
> second test does a 501-way final merge and takes 4519.31 seconds to
> do.  This increased final merge speed accounts for practically all of
> the speedup, and the reason it's faster pretty much has to be that
> it's merging fewer tapes.

It's CPU cache efficiency -- has to be.

> That, in turn, happens for two reasons.  First, because limiting the
> number of tapes increases slightly the memory available for storing
> the tuples belonging to each run, we end up with fewer runs in the
> first place.  The number of runs drops from from 616 to 577, about a
> 7% reduction.  Second, because we have more runs than tapes in the
> second case, it does a 77-way merge prior to the final merge.  Because
> of that 77-way merge, the time at which the second run starts
> producing tuples is slightly later.  Instead of producing the first
> tuple at 70:47.71, we have to wait until 75:72.22.  That's a small
> disadvantage in this case, because it's hypothetically possible that a
> query like this could have a LIMIT and we'd end up worse off overall.
> However, that's pretty unlikely, for three reasons.  Number one, LIMIT
> isn't likely to be used on queries of this type in the first place.
> Number two, if it were used, we'd probably end up with a bounded sort
> plan which would be way faster anyway.  Number three, if somehow we
> still sorted the data set we'd still win in this case if the limit
> were more than about 20% of the total number of tuples.  The much
> faster run time to produce the whole data set is a small price to pay
> for possibly needing to wait a little longer for the first tuple.

Cool.

> So, I'm now feeling pretty bullish about this patch, except for one
> thing, which is that I think the comments are way off-base. Peter
> writes: $When allowedMem is significantly lower than what is required
> for an internal sort, it is unlikely that there are benefits to
> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
> I'm pretty sure that's totally wrong, first of all because commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
> precisely the thing which this comment says we shouldn't

It's more complicated than that. As I said, I think that Knuth
basically had it right with his sweet spot of 7. I think that commit
df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
because a one-pass merge avoided certain overheads not inherent to
polyphase merge, like all that memory accounting stuff, extra palloc()
traffic, etc. The expanded use of per tape buffering we have even in
multi-pass cases likely makes that much less true for us these days.

The reason I haven't actually gone right back down to 7 with this cap
is that it's possible that the added I/O costs outweigh the CPU costs
in extreme cases, even though I think that polyphase merge doesn't
have all that much to do with I/O costs, even with its 1970s
perspective. Knuth doesn't say much about I/O costs -- it's more about
using an extremely small amount of memory effectively (minimizing CPU
costs with very little available main memory).

Furthermore, not limiting ourselves to 7 tapes and seeing a benefit
(benefitting from a few dozen or hundred instead) seems more possible
with the improved merge heap maintenance logic added recently, where
there could be perhaps hundreds of runs merged with very low CPU cost
in the event of presorted input (or, input that is inversely
logically/physically correlated). That would be true because we'd only
examine the top of the heap through, and so I/O costs may matter much
more.

Depending on the exact details, I bet you could see a benefit with
only 7 tapes due to CPU cache efficiency in a case like the one you
describe. Perhaps when sorting integers, but not when sorting collated
text. There are many competing considerations, which I've tried my
best to balance here with a merge order of 500.

> Sound OK?

I'm fine with not mentioning Knuth's sweet spot once more. I guess
it's not of much practical value that he was on to something with
that. I realize, on reflection, that my understanding of what's going
on is very nuanced.

Thanks
-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

10 November 2016, 01:03:23

On Wed, Nov 9, 2016 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.

Also, logtape.c fragmentation made multiple merge pass cases
experience increased random I/O in a way that was only an accident of
our implementation. We've fixed that now, but that problem must have
added further cost that df700e6b40195d28dc764e0c694ac8cef90d4638
*masked* when it was commited in 2006. (I do think that the problem
with the merge heap maintenance fixed recently in
24598337c8d214ba8dcf354130b72c49636bba69 was the biggest problem that
the 2006 work masked, though).

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

10 November 2016, 02:57:31

On Wed, Nov 9, 2016 at 7:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Now, on the other hand, as far as I can see, the actual amount of
>> evidence that 0001 is a good idea which has been presented in this
>> forum is pretty near zero.  You've argued for it on theoretical
>> grounds several times, but theoretical arguments are not a substitute
>> for test results.
>
> See the illustration in TAOCP, vol III, page 273 in the second edition
> -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
> that it's actually a real-world benchmark.

I don't have that publication, and I'm guessing that's not based on
PostgreSQL's implementation.  There's no substitute for tests using
the code we've actually got.

>> So, I'm now feeling pretty bullish about this patch, except for one
>> thing, which is that I think the comments are way off-base. Peter
>> writes: $When allowedMem is significantly lower than what is required
>> for an internal sort, it is unlikely that there are benefits to
>> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
>> I'm pretty sure that's totally wrong, first of all because commit
>> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
>> precisely the thing which this comment says we shouldn't
>
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.
>
> The reason I haven't actually gone right back down to 7 with this cap
> is that it's possible that the added I/O costs outweigh the CPU costs
> in extreme cases, even though I think that polyphase merge doesn't
> have all that much to do with I/O costs, even with its 1970s
> perspective. Knuth doesn't say much about I/O costs -- it's more about
> using an extremely small amount of memory effectively (minimizing CPU
> costs with very little available main memory).
>
> Furthermore, not limiting ourselves to 7 tapes and seeing a benefit
> (benefitting from a few dozen or hundred instead) seems more possible
> with the improved merge heap maintenance logic added recently, where
> there could be perhaps hundreds of runs merged with very low CPU cost
> in the event of presorted input (or, input that is inversely
> logically/physically correlated). That would be true because we'd only
> examine the top of the heap through, and so I/O costs may matter much
> more.
>
> Depending on the exact details, I bet you could see a benefit with
> only 7 tapes due to CPU cache efficiency in a case like the one you
> describe. Perhaps when sorting integers, but not when sorting collated
> text. There are many competing considerations, which I've tried my
> best to balance here with a merge order of 500.

I guess that's possible, but the problem with polyphase merge is that
the increased I/O becomes a pretty significant cost in a hurry.
Here's the same test with max_sort_tapes = 100:

2016-11-09 23:02:49 UTC [48551] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 23:02:55 UTC [48551] LOG:  switching to external sort with
101 tapes: CPU: user: 5.72 s, system: 0.25 s, elapsed: 6.04 s
2016-11-10 00:13:00 UTC [48551] LOG:  finished writing run 544 to tape
49: CPU: user: 4003.00 s, system: 156.89 s, elapsed: 4211.33 s
2016-11-10 00:16:52 UTC [48551] LOG:  finished 51-way merge step: CPU:
user: 4214.84 s, system: 161.94 s, elapsed: 4442.98 s
2016-11-10 00:25:41 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 4704.14 s, system: 170.83 s, elapsed: 4972.47 s
2016-11-10 00:36:47 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5333.12 s, system: 179.94 s, elapsed: 5638.52 s
2016-11-10 00:45:32 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5821.13 s, system: 189.00 s, elapsed: 6163.53 s
2016-11-10 01:01:29 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s
2016-11-10 01:01:29 UTC [48551] LOG:  performsort done (except 100-way
final merge): CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58
s
2016-11-10 01:45:40 UTC [48551] LOG:  external sort ended, 6255949
disk blocks used: CPU: user: 9271.07 s, system: 232.26 s, elapsed:
9771.49 s

This is already worse than max_sort_tapes = 501, though the total
runtime is still better than no cap (the time-to-first-tuple is way
worse, though).   I'm going to try max_sort_tapes = 10 next, but I
think the basic pattern is already fairly clear.  As you reduce the
cap on the number of tapes, (a) the time to build the initial runs
doesn't change very much, (b) the time to perform the final merge
decreases significantly, and (c) the time to perform the non-final
merges increases even faster.  In this particular test configuration
on this particular hardware, rewriting 77 tapes in the 501-tape
configuration wasn't too bad, but now that we're down to 100 tapes, we
have to rewrite 449 tapes out of a total of 544, and that's actually a
loss: rewriting the bulk of your data an extra time to save on cache
misses doesn't pay.  It would probably be even less good if there were
other concurrent activity on the system.  It's possible that if your
polyphase merge is actually being done all in memory, cache efficiency
might remain the dominant consideration, but I think we should assume
that a polyphase merge is doing actual I/O, because it's sort of
pointless to use that algorithm in the first place if there's no real
I/O involved.

At the moment, at least, it looks to me as though we don't need to be
afraid of a *little* bit of polyphase merging, but a *lot* of
polyphase merging is actually pretty bad.  In other words, by imposing
a limit of the number of tapes, we're going to improve sorts that are
smaller than work_mem  * num_tapes * ~1.5 -- because cache efficiency
will be better -- but above that things will probably get worse
because of the increased I/O cost.  From that point of view, a
500-tape limit is the same as saying that it's we don't think it's
entirely reasonable to try to perform a sort that exceeds work_mem by
a factor of more than ~750, whereas a 7-tape limit is the same as
saying that we don't think it's entirely reasonable to perform a sort
that exceeds work_mem by a factor of more than ~10.  That latter
proposition seems entirely untenable.  Our default work_mem setting is
4MB, and people will certainly expect to be able to get away with,
say, an 80MB sort without changing settings.  On the other hand, if
they're sorting more than 3GB with work_mem = 4MB, I think we'll be
justified in making a gentle suggestion that they reconsider that
setting.  Among other arguments, it's going to be pretty slow in that
case no matter what we do here.

Maybe another way of putting this is that, while there's clearly a
benefit to having some kind of a cap, it's appropriate to pick a large
value, such as 500.  Having no cap at all risks creating many extra
tapes that just waste memory, and also risks an unduly
cache-inefficient final merge.  Reigning that in makes sense.
However, we can't reign it in too far or we'll create slow polyphase
merges in case that are reasonably likely to occur in real life.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

10 November 2016, 03:19:05

On Wed, Nov 9, 2016 at 6:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I guess that's possible, but the problem with polyphase merge is that
> the increased I/O becomes a pretty significant cost in a hurry.

Not if you have a huge RAID array. :-)

Obviously I'm not seriously suggesting that we revise the cap from 500
to 7. We're only concerned about the constant factors here. There is a
clearly a need to make some simplifying assumptions. I think that you
understand this very well, though.

> Maybe another way of putting this is that, while there's clearly a
> benefit to having some kind of a cap, it's appropriate to pick a large
> value, such as 500.  Having no cap at all risks creating many extra
> tapes that just waste memory, and also risks an unduly
> cache-inefficient final merge.  Reigning that in makes sense.
> However, we can't reign it in too far or we'll create slow polyphase
> merges in case that are reasonably likely to occur in real life.

I completely agree with your analysis.

-- 
Peter Geoghegan

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

15 November 2016, 15:44:14

On Wed, Nov 9, 2016 at 10:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Maybe another way of putting this is that, while there's clearly a
>> benefit to having some kind of a cap, it's appropriate to pick a large
>> value, such as 500.  Having no cap at all risks creating many extra
>> tapes that just waste memory, and also risks an unduly
>> cache-inefficient final merge.  Reigning that in makes sense.
>> However, we can't reign it in too far or we'll create slow polyphase
>> merges in case that are reasonably likely to occur in real life.
>
> I completely agree with your analysis.

Cool.  BTW, my run with 10 tapes completed in 10696528.377 ms
(02:58:16.528) - i.e. almost 3 minutes slower than with no tape limit.
Building runs took 4260.16 s, and the final merge pass began at
8239.12 s.  That's certainly better than I expected, and it seems to
show that even if the number of tapes is grossly inadequate for the
number of runs, you can still make up most of the time that you lose
to I/O with improved cache efficiency -- at least under favorable
circumstances.  Of course, on many systems I/O bandwidth will be a
scarce resource, so that argument can be overdone -- and even if not,
a 10-tape sort version takes FAR longer to deliver the first tuple.

I also tried this out with work_mem = 512MB.  Doubling work_mem
reduces the number of runs enough that we don't get a polyphase merge
in any case.  With no limit on tapes:

2016-11-10 11:24:45 UTC [54042] LOG:  switching to external sort with
1873 tapes: CPU: user: 11.34 s, system: 0.48 s, elapsed: 12.13 s
2016-11-10 12:36:22 UTC [54042] LOG:  finished writing run 308 to tape
307: CPU: user: 4096.63 s, system: 156.88 s, elapsed: 4309.66 s
2016-11-10 12:36:22 UTC [54042] LOG:  using 516563 KB of memory for
read buffers among 308 input tapes
2016-11-10 12:36:30 UTC [54042] LOG:  performsort done (except 308-way
final merge): CPU: user: 4097.75 s, system: 157.24 s, elapsed: 4317.76
s
2016-11-10 13:54:07 UTC [54042] LOG:  external sort ended, 6255577
disk blocks used: CPU: user: 8638.72 s, system: 177.42 s, elapsed:
8974.44 s

With a max_sort_tapes = 501:

2016-11-10 14:23:50 UTC [54042] LOG:  switching to external sort with
502 tapes: CPU: user: 10.99 s, system: 0.54 s, elapsed: 11.57 s
2016-11-10 15:36:47 UTC [54042] LOG:  finished writing run 278 to tape
277: CPU: user: 4190.31 s, system: 155.33 s, elapsed: 4388.86 s
2016-11-10 15:36:47 UTC [54042] LOG:  using 517313 KB of memory for
read buffers among 278 input tapes
2016-11-10 15:36:54 UTC [54042] LOG:  performsort done (except 278-way
final merge): CPU: user: 4191.36 s, system: 155.68 s, elapsed: 4395.66
s
2016-11-10 16:53:39 UTC [54042] LOG:  external sort ended, 6255699
disk blocks used: CPU: user: 8673.07 s, system: 175.93 s, elapsed:
9000.80 s

0.3% slower with the tape limit, but that might be noise.  Even if
not, it seems pretty silly to create 1873 tapes when we only need
~300.

At work_mem = 2GB:

2016-11-10 18:08:00 UTC [54042] LOG:  switching to external sort with
7490 tapes: CPU: user: 44.28 s, system: 1.99 s, elapsed: 46.33 s
2016-11-10 19:23:06 UTC [54042] LOG:  finished writing run 77 to tape
76: CPU: user: 4342.10 s, system: 156.21 s, elapsed: 4551.95 s
2016-11-10 19:23:06 UTC [54042] LOG:  using 2095202 KB of memory for
read buffers among 77 input tapes
2016-11-10 19:23:12 UTC [54042] LOG:  performsort done (except 77-way
final merge): CPU: user: 4343.36 s, system: 157.07 s, elapsed: 4558.79
s
2016-11-10 20:24:24 UTC [54042] LOG:  external sort ended, 6255946
disk blocks used: CPU: user: 7894.71 s, system: 176.36 s, elapsed:
8230.13 s

At work_mem = 2GB, max_sort_tapes = 501:

2016-11-10 21:28:23 UTC [54042] LOG:  switching to external sort with
502 tapes: CPU: user: 44.09 s,system: 1.94 s, elapsed: 46.07 s
2016-11-10 22:42:28 UTC [54042] LOG:  finished writing run 68 to tape
67: CPU: user: 4278.49 s, system: 154.39 s, elapsed: 4490.25 s
2016-11-10 22:42:28 UTC [54042] LOG:  using 2095427 KB of memory for
read buffers among 68 input tapes
2016-11-10 22:42:34 UTC [54042] LOG:  performsort done (except 68-way
final merge): CPU: user: 4279.60 s, system: 155.21 s, elapsed: 4496.83
s
2016-11-10 23:42:10 UTC [54042] LOG:  external sort ended, 6255983
disk blocks used: CPU: user: 7733.98 s, system: 173.85 s, elapsed:
8072.55 s

Roughly 2% faster.  Maybe still noise, but less likely.  7490 tapes
certainly seems over the top.

At work_mem = 8GB:

2016-11-14 19:17:28 UTC [54042] LOG:  switching to external sort with
29960 tapes: CPU: user: 183.80 s, system: 7.71 s, elapsed: 191.61 s
2016-11-14 20:32:02 UTC [54042] LOG:  finished writing run 20 to tape
19: CPU: user: 4431.44 s, system: 176.82 s, elapsed: 4665.16 s
2016-11-14 20:32:02 UTC [54042] LOG:  using 8388083 KB of memory for
read buffers among 20 input tapes
2016-11-14 20:32:26 UTC [54042] LOG:  performsort done (except 20-way
final merge): CPU: user: 4432.99 s, system: 181.29 s, elapsed: 4689.52
s
2016-11-14 21:30:56 UTC [54042] LOG:  external sort ended, 6256003
disk blocks used: CPU: user: 7835.83 s, system: 199.01 s, elapsed:
8199.29 s

At work_mem = 8GB, max_sort_tapes = 501:

2016-11-14 21:52:43 UTC [54042] LOG:  switching to external sort with
502 tapes: CPU: user: 181.08 s, system: 7.66 s, elapsed: 189.05 s
2016-11-14 23:06:06 UTC [54042] LOG:  finished writing run 17 to tape
16: CPU: user: 4381.56 s, system: 161.82 s, elapsed: 4591.63 s
2016-11-14 23:06:06 UTC [54042] LOG:  using 8388158 KB of memory for
read buffers among 17 input tapes
2016-11-14 23:06:36 UTC [54042] LOG:  performsort done (except 17-way
final merge): CPU: user: 4383.45 s, system: 165.32 s, elapsed: 4622.04
s
2016-11-14 23:54:00 UTC [54042] LOG:  external sort ended, 6256002
disk blocks used: CPU: user: 7124.49 s, system: 182.16 s, elapsed:
7466.18 s

Roughly 9% faster.  Building runs seems to be very slowly degrading as
we increase work_mem, but the final merge is speeding up somewhat more
quickly.  Intuitively that makes sense to me: if merging were faster
than quicksorting, we could just merge-sort all the time instead of
using quicksort for internal sorts.  Also, we've got 29960 tapes now,
better than three orders of magnitude more than what we actually need.
At this work_mem setting, 501 tapes is enough to efficiently sort at
least 4TB of data and quite possibly a good bit more.

So, committed 0001, with comment changes along the lines I proposed before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

04 December 2016, 01:29:08

On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> What do we need to teach pg_restore about parallel CREATE INDEX, if
> anything at all? Could this be as simple as a blanket disabling of
> parallelism for CREATE INDEX from pg_restore? Or, does it need to be
> more sophisticated than that? I suppose that tools like reindexdb and
> pgbench must be considered in a similar way.

I still haven't resolved this question, which seems like the most
important outstanding question, but I attach V6. Changes:

* tuplesort.c was adapted to use the recently committed condition
variables stuff. This made things cleaner. No more ad-hoc WaitLatch()
looping.

* Adapted docs to mention the newly committed max_parallel_workers GUC
in the context of discussing proposed max_parallel_workers_maintenance
GUC.

* Fixed trivial assertion failure bug that could be tripped when a
conventional sort uses very little memory.

--
Peter Geoghegan

On Mon, Dec 5, 2016 at 7:44 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I do share your concerns about unpredictable behavior - that's
> particularly worrying for pg_restore, which may be used for time-
> sensitive use cases (DR, migrations between versions), so unpredictable
> changes in behavior / duration are unwelcome.

Right.

> But isn't this more a deficiency in pg_restore, than in CREATE INDEX?
> The issue seems to be that the reltuples value may or may not get
> updated, so maybe forcing ANALYZE (even very low statistics_target
> values would do the trick, I think) would be more appropriate solution?
> Or maybe it's time add at least some rudimentary statistics into the
> dumps (the reltuples field seems like a good candidate).

I think that there is a number of reasonable ways of looking at it. It
might also be worthwhile to have a minimal ANALYZE performed by CREATE
INDEX directly, iff there are no preexisting statistics (there is
definitely going to be something pg_restore-like that we cannot fix --
some ETL tool, for example). Perhaps, as an additional condition to
proceeding with such an ANALYZE, it should also only happen when there
is any chance at all of parallelism being used (but then you get into
having to establish the relation size reliably in the absence of any
pg_class.relpages, which isn't very appealing when there are many tiny
indexes).

In summary, I would really like it if a consensus emerged on how
parallel CREATE INDEX should handle the ecosystem of tools like
pg_restore, reindexdb, and so on. Personally, I'm neutral on which
general approach should be taken. Proposals from other hackers about
what to do here are particularly welcome.

Moved to next CF with "needs review" status.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

21 December 2016, 01:53:08

On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I find this unification business really complicated. I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I just looked at the buffile.c/buffile.h changes in the latest version
of the patch and I agree with this criticism, though maybe not with
the proposed solution.  I actually don't understand what "unification"
is supposed to mean.  The patch really doesn't explain that anywhere
that I can see.  It says stuff like:

+ * Parallel operations can use an interface to unify multiple worker-owned
+ * BufFiles and a leader-owned BufFile within a leader process.  This relies
+ * on various fd.c conventions about the naming of temporary files.

That comment tells you that unification is a thing you can do -- via
an unspecified interface for unspecified reasons using unspecified
conventions -- but it doesn't tell you what the semantics of it are
supposed to be.  For example, if we "unify" several BufFiles, do they
then have a shared seek pointer?  Do the existing contents effectively
get concatenated in an unpredictable order, or are they all expected
to be empty at the time unification happens?  Or something else?  It's
fine to make up new words -- indeed, in some sense that is the essence
of writing any complex problem -- but you have to define them.  As far
as I can tell, the idea is that we're somehow magically concatenating
the BufFiles into one big super-BufFile, but I'm fuzzy on exactly
what's supposed to be going on there.

It's hard to understand how something like this doesn't leak
resources.  Maybe that's been thought about here, but it isn't very
clear to me how it's supposed to work.  In Heikki's proposal, if
process A is trying to read a file owned by process B, and process B
dies and removes the file before process A gets around to reading it,
we have got trouble, especially on Windows which apparently has low
tolerance for such things.  Peter's proposal avoids that - I *think* -
by making the leader responsible for all resource cleanup, but that's
inferior to the design we've used for other sorts of shared resource
cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
detach always takes responsibility.  That avoids assuming that we're
always dealing with a leader-follower situation, it doesn't
categorically require the leader to be the one who creates the shared
resource, and it doesn't require the leader to be the last process to
die.

Imagine a data structure that is stored in dynamic shared memory and
contains space for a filename, a reference count, and a mutex.  Let's
call this thing a SharedTemporaryFile or something like that.  It
offers these APIs:

extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);

After setting aside sizeof(SharedTemporaryFile) bytes in your shared
DSM sgement, you call SharedTemporaryFileInitialize() to initialize
them.  Then, every process that cares about the file does
SharedTemporaryFileAttach(), which bumps the reference count and sets
an on_dsm_detach hook to decrement the reference count and unlink the
file if the reference count thereby reaches 0.  One of those processes
does SharedTemporaryFileAssign(), which fills in the pathname and
clears FD_TEMPORARY.  Then, any process that has attached can call
SharedTemporaryFileGetFile() to get a File which can then be accessed
normally.  So, the pattern for parallel sort would be:

- Leader sets aside space and calls SharedTemporaryFileInitialize()
and SharedTemporaryFileAttach().
- The cooperating worker calls SharedTemporaryFileAttach() and then
SharedTemporaryFileAssign().
- The leader then calls SharedTemporaryFileGetFile().

Since the leader can attach to the file before the path name is filled
in, there's no window where the file is at risk of being leaked.
Before SharedTemporaryFileAssign(), the worker is solely responsible
for removing the file; after that call, whichever of the leader and
the worker exits last will remove the file.

> That leaves one problem, though: reusing space in the final merge phase. If
> the tapes being merged belong to different LogicalTapeSets, and create one
> new tape to hold the result, the new tape cannot easily reuse the space of
> the input tapes because they are on different tape sets.

If the worker is always completely finished with the tape before the
leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
the tape and overwrite it like any other?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

21 December 2016, 04:14:36

On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> What I have in mind is something like the attached patch. It refactors
>> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
>> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
>> doesn't have the concept of a tape number anymore, it can contain any number
>> of tapes, and you can create more on the fly. With that, it'd be fairly easy
>> to make tuplesort.c merge LogicalTapes that came from different tape sets,
>> backed by different BufFiles. I think that'd avoid much of the unification
>> code.
>
> I just looked at the buffile.c/buffile.h changes in the latest version
> of the patch and I agree with this criticism, though maybe not with
> the proposed solution.  I actually don't understand what "unification"
> is supposed to mean.  The patch really doesn't explain that anywhere
> that I can see.  It says stuff like:
>
> + * Parallel operations can use an interface to unify multiple worker-owned
> + * BufFiles and a leader-owned BufFile within a leader process.  This relies
> + * on various fd.c conventions about the naming of temporary files.

Without meaning to sound glib, unification is the process by which
parallel CREATE INDEX has the leader read temp files from workers
sufficient to complete its final on-the-fly merge. So, it's a
terminology that's bit like "speculative insertion" was up until
UPSERT was committed: a concept that is somewhat in flux, and
describes a new low-level mechanism built to support a higher level
operation, which must accord with a higher level set of requirements
(so, for speculative insertion, that would be avoiding "unprincipled
deadlocks" and so on). That being the case, maybe "unification" isn't
useful as an precise piece of terminology at this point, but that will
change.

While I'm fairly confident that I basically have the right idea with
this patch, I think that you are better at judging the ins and outs of
resource management than I am, not least because of the experience of
working on parallel query itself. Also, I'm signed up to review
parallel hash join in large part because I think there might be some
convergence concerning the sharing of BufFiles among parallel workers.
I don't think I'm qualified to judge what a general abstraction like
this should look like, but I'm trying to get there.

> That comment tells you that unification is a thing you can do -- via
> an unspecified interface for unspecified reasons using unspecified
> conventions -- but it doesn't tell you what the semantics of it are
> supposed to be.  For example, if we "unify" several BufFiles, do they
> then have a shared seek pointer?

No.

> Do the existing contents effectively
> get concatenated in an unpredictable order, or are they all expected
> to be empty at the time unification happens?  Or something else?

The order is the same order as ordinal identifiers are assigned to
workers within tuplesort.c, which is undefined, with the notable
exception of the leader's own identifier (-1) and area of the unified
BufFile space (this is only relevant in randomAccess cases, where
leader may write stuff out to its own reserved part of the BufFile
space). It only matters that the bit of metadata in shared memory is
in that same order, which it clearly will be. So, it's unpredictable,
but in the same way that ordinal identifiers are assigned in a
not-well-defined order; it doesn't or at least shouldn't matter. We
can imagine a case where it does matter, and we probably should, but
that case isn't parallel CREATE INDEX.

>It's fine to make up new words -- indeed, in some sense that is the essence
> of writing any complex problem -- but you have to define them.

I invite you to help me define this new word.

> It's hard to understand how something like this doesn't leak
> resources.  Maybe that's been thought about here, but it isn't very
> clear to me how it's supposed to work.

I agree that it would be useful to centrally document what all this
unification stuff is about. Suggestions on where that should live are
welcome.

> In Heikki's proposal, if
> process A is trying to read a file owned by process B, and process B
> dies and removes the file before process A gets around to reading it,
> we have got trouble, especially on Windows which apparently has low
> tolerance for such things.  Peter's proposal avoids that - I *think* -
> by making the leader responsible for all resource cleanup, but that's
> inferior to the design we've used for other sorts of shared resource
> cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
> detach always takes responsibility.

Maybe it's inferior to that, but I think what Heikki proposes is more
or less complementary to what I've proposed, and has nothing to do
with resource management and plenty to do with making the logtape.c
interface look nice, AFAICT. It's also about refactoring/simplifying
logtape.c itself, while we're at it. I believe that Heikki has yet to
comment either way on my approach to resource management, one aspect
of the patch that I was particularly keen on your looking into.

The theory of operation here is that workers own their own BufFiles,
and are responsible for deleting them when they die. The assumption,
rightly or wrongly, is that it's sufficient that workers flush
everything out (write out temp files), and yield control to the
leader, which will open their temp files for the duration of the
leader final on-the-fly merge. The resource manager in the leader
knows it isn't supposed to ever delete worker-owned files (just
close() the FDs), and the leader errors if it cannot find temp files
that match what it expects. If there is a an error in the leader, it
shuts down workers, and they clean up, more than likely. If there is
an error in the worker, or if the files cannot be deleted (e.g., if
there is a classic hard crash scenario), we should also be okay,
because nobody will trip up on some old temp file from some worker,
since fd.c has some gumption about what workers need to do (and what
the leader needs to avoid) in the event of a hard crash. I don't see a
risk of file descriptor leaks, which may or may not have been part of
your concern (please clarify).

> That avoids assuming that we're
> always dealing with a leader-follower situation, it doesn't
> categorically require the leader to be the one who creates the shared
> resource, and it doesn't require the leader to be the last process to
> die.

I have an open mind about that, especially given the fact that I hope
to generalize the unification stuff further, but I am not aware of any
reason why that is strictly necessary.

> Imagine a data structure that is stored in dynamic shared memory and
> contains space for a filename, a reference count, and a mutex.  Let's
> call this thing a SharedTemporaryFile or something like that.  It
> offers these APIs:
>
> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);

I'm a little bit tired right now, and I have yet to look at Thomas'
parallel hash join patch in any detail. I'm interested in what you
have to say here, but I think that I need to learn more about its
requirements in order to have an informed opinion.

>> That leaves one problem, though: reusing space in the final merge phase. If
>> the tapes being merged belong to different LogicalTapeSets, and create one
>> new tape to hold the result, the new tape cannot easily reuse the space of
>> the input tapes because they are on different tape sets.
>
> If the worker is always completely finished with the tape before the
> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
> the tape and overwrite it like any other?

I'll remind you that parallel CREATE INDEX doesn't actually ever need
to be randomAccess, and so we are not actually going to ever need to
do this as things stand. I wrote the code that way in order to not
break the existing interface, which seemed like a blocker to posting
the patch. I am open to the idea of such an "adoption" occurring, even
though it actually wouldn't help any case that exists in the patch as
proposed. I didn't go that far in part because it seemed premature,
given that nobody had looked at my work to date at the time, and given
the fact that there'd be no initial user-visible benefit, and given
how the exact meaning of "unification" was (and is) somewhat in flux.

I see no good reason to not do that, although that might change if I
actually seriously undertook to teach the leader about this kind of
"adoption". I suspect that the interface specification would make for
confusing reading, which isn't terribly appealing, but I'm sure I
could manage to make it work given time.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

21 December 2016, 07:03:13

On Tue, Dec 20, 2016 at 8:14 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Without meaning to sound glib, unification is the process by which
> parallel CREATE INDEX has the leader read temp files from workers
> sufficient to complete its final on-the-fly merge.

That's not glib, but you can't in the end define BufFile unification
in terms of what parallel CREATE INDEX needs.  Whatever changes we
make to lower-level abstractions in the service of some higher-level
goal need to be explainable on their own terms.

>>It's fine to make up new words -- indeed, in some sense that is the essence
>> of writing any complex problem -- but you have to define them.
>
> I invite you to help me define this new word.

If at some point I'm able to understand what it means, I'll try to do
that.  I think you're loosely using "unification" to mean combining
stuff from different backends in some way that depends on the
particular context, so that "BufFile unification" can be different
from "LogicalTape unification".  But that's just punting the question
of what each of those things actually are.

> Maybe it's inferior to that, but I think what Heikki proposes is more
> or less complementary to what I've proposed, and has nothing to do
> with resource management and plenty to do with making the logtape.c
> interface look nice, AFAICT. It's also about refactoring/simplifying
> logtape.c itself, while we're at it. I believe that Heikki has yet to
> comment either way on my approach to resource management, one aspect
> of the patch that I was particularly keen on your looking into.

My reading of Heikki's point was that there's not much point in
touching the BufFile level of things if we can do all of the necessary
stuff at the LogicalTape level, and I agree with him about that.  If a
shared BufFile had a shared read-write pointer, that would be a good
justification for having it.  But it seems like unification at the
BufFile level is just concatenation, and that can be done just as well
at the LogicalTape level, so why tinker with BufFile?  As I've said, I
think there's some low-level hacking needed here to make sure files
get removed at the correct time in all cases, but apart from that I
see no good reason to push the concatenation operation all the way
down into BufFile.

> The theory of operation here is that workers own their own BufFiles,
> and are responsible for deleting them when they die. The assumption,
> rightly or wrongly, is that it's sufficient that workers flush
> everything out (write out temp files), and yield control to the
> leader, which will open their temp files for the duration of the
> leader final on-the-fly merge. The resource manager in the leader
> knows it isn't supposed to ever delete worker-owned files (just
> close() the FDs), and the leader errors if it cannot find temp files
> that match what it expects. If there is a an error in the leader, it
> shuts down workers, and they clean up, more than likely. If there is
> an error in the worker, or if the files cannot be deleted (e.g., if
> there is a classic hard crash scenario), we should also be okay,
> because nobody will trip up on some old temp file from some worker,
> since fd.c has some gumption about what workers need to do (and what
> the leader needs to avoid) in the event of a hard crash. I don't see a
> risk of file descriptor leaks, which may or may not have been part of
> your concern (please clarify).

I don't think there's any new issue with file descriptor leaks here,
but I think there is a risk of calling unlink() too early or too late
with your design.  My proposal was an effort to nail that down real
tight.

>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> I'll remind you that parallel CREATE INDEX doesn't actually ever need
> to be randomAccess, and so we are not actually going to ever need to
> do this as things stand. I wrote the code that way in order to not
> break the existing interface, which seemed like a blocker to posting
> the patch. I am open to the idea of such an "adoption" occurring, even
> though it actually wouldn't help any case that exists in the patch as
> proposed. I didn't go that far in part because it seemed premature,
> given that nobody had looked at my work to date at the time, and given
> the fact that there'd be no initial user-visible benefit, and given
> how the exact meaning of "unification" was (and is) somewhat in flux.
>
> I see no good reason to not do that, although that might change if I
> actually seriously undertook to teach the leader about this kind of
> "adoption". I suspect that the interface specification would make for
> confusing reading, which isn't terribly appealing, but I'm sure I
> could manage to make it work given time.

I think the interface is pretty clear: the worker's logical tapes get
incorporated into the leader's LogicalTapeSet as if they'd been there
all along.  After all, by the time this is happening, IIUC (please
confirm), the worker is done with those tapes and will never read or
modify them again.  If that's right, the worker just needs a way to
identify those tapes to the leader, which can then add them to its
LogicalTapeSet.  That's it.  It needs a way to identify them, but I
think that shouldn't be hard; in fact, I think your patch has
something like that already.  And it needs to make sure that the files
get removed at the right time, but I already sketched a solution to
that problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Heikki Linnakangas

Date:

21 December 2016, 15:04:07

On 12/21/2016 12:53 AM, Robert Haas wrote:
>> That leaves one problem, though: reusing space in the final merge phase. If
>> the tapes being merged belong to different LogicalTapeSets, and create one
>> new tape to hold the result, the new tape cannot easily reuse the space of
>> the input tapes because they are on different tape sets.
>
> If the worker is always completely finished with the tape before the
> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
> the tape and overwrite it like any other?

Currently, the logical tape code assumes that all tapes in a single 
LogicalTapeSet are allocated from the same BufFile. The logical tape's 
on-disk format contains block numbers, to point to the next/prev block 
of the tape [1], and they're assumed to refer to the same file. That 
allows reusing space efficiently during the merge. After you have read 
the first block from tapes A, B and C, you can immediately reuse those 
three blocks for output tape D.

Now, if you read multiple tapes, from different LogicalTapeSet, hence 
backed by different BufFiles, you cannot reuse the space from those 
different tapes for a single output tape, because the on-disk format 
doesn't allow referring to blocks in other files. You could reuse the 
space of *one* of the input tapes, by placing the output tape in the 
same LogicalTapeSet, but not all of them.

We could enhance that, by using "filename + block number" instead of 
just block number, in the pointers in the logical tapes. Then you could 
spread one logical tape across multiple files. Probably not worth it in 
practice, though.

[1] As the code stands, there are no next/prev pointers, but a tree of 
"indirect" blocks. But I'm planning to change that to simpler next/prev 
pointers, in 
https://www.postgresql.org/message-id/flat/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi

- Heikki

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

21 December 2016, 17:00:21

On Wed, Dec 21, 2016 at 7:04 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> Currently, the logical tape code assumes that all tapes in a single
> LogicalTapeSet are allocated from the same BufFile. The logical tape's
> on-disk format contains block numbers, to point to the next/prev block of
> the tape [1], and they're assumed to refer to the same file. That allows
> reusing space efficiently during the merge. After you have read the first
> block from tapes A, B and C, you can immediately reuse those three blocks
> for output tape D.

I see.  Hmm.

> Now, if you read multiple tapes, from different LogicalTapeSet, hence backed
> by different BufFiles, you cannot reuse the space from those different tapes
> for a single output tape, because the on-disk format doesn't allow referring
> to blocks in other files. You could reuse the space of *one* of the input
> tapes, by placing the output tape in the same LogicalTapeSet, but not all of
> them.
>
> We could enhance that, by using "filename + block number" instead of just
> block number, in the pointers in the logical tapes. Then you could spread
> one logical tape across multiple files. Probably not worth it in practice,
> though.

OK, so the options as I understand them are:

1. Enhance the logical tape set infrastructure in the manner you
mention, to support filename (or more likely a proxy for filename) +
block number in the logical tape pointers.  Then, tapes can be
transferred from one LogicalTapeSet to another.

2. Enhance the BufFile infrastructure to support some notion of a
shared BufFile so that multiple processes can be reading and writing
blocks in the same BufFile.  Then, extend the logical tape
infrastruture so that we also have the notion of a shared LogicalTape.
This means that things like ltsGetFreeBlock() need to be re-engineered
to handle concurrency with other backends.

3. Just live with the waste of space.

I would guess that (1) is easier than (2).  Also, (2) might provoke
contention while writing tapes that is otherwise completely
unnecessary.  It seems silly to have multiple backends fighting over
the same end-of-file pointer for the same file when they could just
write to different files instead.

Another tangentially-related problem I just realized is that we need
to somehow handle the issues that tqueue.c does when transferring
tuples between backends -- most of the time there's no problem, but if
anonymous record types are involved then tuples require "remapping".
It's probably harder to provoke a failure in the tuplesort case than
with parallel query per se, but it's probably not impossible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

21 December 2016, 21:21:16

On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> 3. Just live with the waste of space.

I am loathe to create a special case for the parallel interface too,
but I think it's possible that *no* caller will ever actually need to
live with this restriction at any time in the future. I am strongly
convinced that adopting tuplesort.c for parallelism should involve
partitioning [1]. With that approach, even randomAccess callers will
not want to read at random for one big materialized tape, since that's
at odds with the whole point of partitioning, which is to remove any
dependencies between workers quickly and early, so that as much work
as possible is pushed down into workers. If a merge join were
performed in a world where we have this kind of partitioning, we
definitely wouldn't require one big materialized tape that is
accessible within each worker.

What are the chances of any real user actually having to live with the
waste of space at some point in the future?

> Another tangentially-related problem I just realized is that we need
> to somehow handle the issues that tqueue.c does when transferring
> tuples between backends -- most of the time there's no problem, but if
> anonymous record types are involved then tuples require "remapping".
> It's probably harder to provoke a failure in the tuplesort case than
> with parallel query per se, but it's probably not impossible.

Thanks for pointing that out. I'll look into it.

BTW, I discovered a bug where there is very low memory available
within each worker -- tuplesort.c throws an error within workers
immediately. It's just a matter of making sure that they at least have
64KB of workMem, which is a pretty straightforward fix. Obviously it
makes no sense to use so little memory in the first place; this is a
corner case.

[1] https://www.postgresql.org/message-id/CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

22 December 2016, 00:48:09

On Wed, Dec 21, 2016 at 10:21 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 3. Just live with the waste of space.
>
> I am loathe to create a special case for the parallel interface too,
> but I think it's possible that *no* caller will ever actually need to
> live with this restriction at any time in the future.

I just realized that you were actually talking about the waste of
space in workers here, as opposed to the theoretical waste of space
that would occur in the leader should there ever be a parallel
randomAccess tuplesort caller.

To be clear, I am totally against allowing a waste of logtape.c temp
file space in *workers*, because that implies a cost that will most
certainly be felt by users all the time.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

04 January 2017, 02:53:59

On Tue, Dec 20, 2016 at 5:14 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Imagine a data structure that is stored in dynamic shared memory and
>> contains space for a filename, a reference count, and a mutex.  Let's
>> call this thing a SharedTemporaryFile or something like that.  It
>> offers these APIs:
>>
>> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
>> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
>> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
>> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);
>
> I'm a little bit tired right now, and I have yet to look at Thomas'
> parallel hash join patch in any detail. I'm interested in what you
> have to say here, but I think that I need to learn more about its
> requirements in order to have an informed opinion.

Attached is V7 of the patch. The overall emphasis with this revision
is on bringing clarity on how much can be accomplished using
generalized infrastructure, explaining the unification mechanism
coherently, and related issues.

Notable changes
---------------

* Rebased to work with the newly simplified logtape.c representation
(the recent removal of "indirect blocks" by Heikki). Heikki's work was
something that helped with simplifying the whole unification
mechanism, to a significant degree. I think that there was over a 50%
reduction in logtape.c lines of code in this revision.

* randomAccess cases are now able to reclaim disk space from blocks
originally written by workers. This further simplifies logtape.c
changes significantly. I don't think that this is important because
some future randomAccess caller might otherwise have double the
storage overhead for their parallel sort, or even because of the
disproportionate performance penalty such a caller would experience;
rather, it's important because it removes previous special cases (that
were internal to logtape.c).

For example, aside from the fact that worker tapes within a unified
tapeset will often have a non-zero offset, there is no state that
actually remembers that this is a unified tapeset, because that isn't
needed anymore. And, even though we reclaim blocks from workers, we
only have one central chokepoint for applying worker offsets in the
leader (that chokepoint is ltsReadFillBuffer()). Routines tasked with
things like positional seeking for mark/restore for certain tuplesort
clients (which are. in general, poorly tested) now need to have no
knowledge of unification while still working just the same. This is a
consequence of the fact that ltsWriteBlock() callers (and
ltsWriteBlock() itself) never have to think about offsets. I'm pretty
happy about that.

* pg_restore now prevents the planner from deciding that parallelism
should be used, in order to make restoration behavior more consistent
and predictable. Iff a dump being restored happens to have a CREATE
INDEX with the new index storage parameter parallel_workers set, then
pg_restore will use parallel CREATE INDEX. This is accomplished with a
new GUC, enable_parallelddl (since "max_parallel_workers_maintenance =
0" will disable parallel CREATE INDEX across the board, ISTM that a
second new GUC is required). I think that this behavior the right
trade-off for pg_restore goes, although I still don't feel
particularly strongly about it. There is now a concrete proposal on
what to do about pg_restore, if nothing else. To recap, the general
concern address here is that there are typically no ANALYZE stats
available for the planner to decide with when pg_restore runs CREATE
INDEX, although that isn't always true, which was both surprising and
inconsistent.

* Addresses the problem of anonymous record types and their need for
"remapping" across parallel workers. I've simply pushed the
responsibility on callers within tuplesort.h contract; parallel CREATE
INDEX callers don't need to care about this, as explained there.
(CLUSTER tuplesorts would also be safe.)

* Puts the whole rationale for unification into one large comment
above the function BufFileUnify(), and removes traces of the same kind
of discussion from everywhere else. I think that buffile.c is the
right central place to discuss the unification mechanism, now that
logtape.c has been greatly simplified. All the fd.c changes are in
routines that are only ever called by buffile.c anyway, and are not
too complicated (in general, temp fd.c files are only ever owned
transitively, through BufFiles). So, morally, the unification
mechanism is something that wholly belongs to buffile.c, since
unification is all about temp files, and buffile.h is the interface
through which temp files are owned and accessed in general, without
exception.

Unification remains specialized
-------------------------------

On the one hand, BufFileUnify() now describes the whole idea of
unification in detail, in its own general terms, including its
performance characteristics, but on the other hand it doesn't pretend
to be more general than it is (that's why we really have to talk about
performance characteristics). It doesn't go as far as admitting to
being the thing that logtape.c uses for parallel sort, but even that
doesn't seem totally unreasonable to me. I think that BufFileUnify()
might also end up being used by tuplestore.c, so it isn't entirely
non-general, but I now realize that it's unlikely to be used by
parallel hash join. So, while randomAccess reclamation of worker
blocks within the leader now occurs, I have not followed Robert's
suggestion in full. For example, I didn't do this: "ltsGetFreeBlock()
need to be re-engineered to handle concurrency with other backends".
The more I've thought about it, the more appropriate the kind of
specialization I've come up with seems. I've concluded:

- Sorting is important, and therefore worth adding non-general
infrastructure in support of. It's important enough to have its own
logtape.c module, so why not this? Much of buffile.c was explicitly
written with sorting and hashing in mind from the beginning. We use
BufFiles for other things, but those two things are by far the two
most important users of temp files, and the only really compelling
candidates for parallelization.

- There are limited opportunities to share BufFile infrastructure for
parallel sorting and parallel hashing. Hashing is inverse to sorting
conceptually, so it should not be surprising that this is the case. By
that I mean that hashing is characterized by logical division and
physical combination, whereas sorting is characterized by physical
division and logical combination. Parallel tuplesort naturally allows
each worker to do an enormous amount of work with whatever data it is
fed by the parallel heap scan that it joins, *long* before the data
needs to be combined with data from other workers in any way.

Consider this code from Thomas' parallel hash join patch:

> +bool
> +ExecHashCheckForEarlyExit(HashJoinTable hashtable)
> +{
> +   /*
> +    * The golden rule of leader deadlock avoidance: since leader processes
> +    * have two separate roles, namely reading from worker queues AND executing
> +    * the same plan as workers, we must never allow a leader to wait for
> +    * workers if there is any possibility those workers have emitted tuples.
> +    * Otherwise we could get into a situation where a worker fills up its
> +    * output tuple queue and begins waiting for the leader to read, while
> +    * the leader is busy waiting for the worker.
> +    *
> +    * Parallel hash joins with shared tables are inherently susceptible to
> +    * such deadlocks because there are points at which all participants must
> +    * wait (you can't start check for unmatched tuples in the hash table until
> +    * probing has completed in all workers, etc).

Parallel sort will never have to do anything like this. There is
minimal IPC before the leader's merge, and the dependencies between
phases are extremely simple (there is only one; workers need to finish
before leader can merge, and must stick around in a quiescent state
throughout). Data throughput is what tuplesort cares about; it doesn't
really care about latency. Whereas, I gather that there needs to be
continual gossip between hash join workers (those building a hash
table) about the number of batches. They don't have to be in perfect
lockstep, but they need to cooperate closely; the IPC is pretty eager,
and therefore latency sensitive. Thomas makes use of atomic ops in his
patch, which makes sense, but I'd never bother with anything like that
for parallel tuplesort; there'd be no measurable benefit there.

In general, it's not obvious to me that the SharedTemporaryFile() API
that Robert sketched recently (or any very general shared file
interface that does things like buffer writes in shared memory, uses a
shared read pointer, etc) is right for either parallel hash join or
parallel sort. I don't see that there is much to be said for a
reference count mechanism for parallel sort BufFiles, since the
dependencies are so simple and fixed, and for hash join, a much
tighter mechanism seems desirable. I can't think why Thomas would want
a shared read pointer, since the way he builds the shared hash table
leaves it immutable once probing is underway; ISTM that he'll want
that kind of mechanism to operate at a higher level, in a more
specialized way.

That said, I don't actually know what Thomas has in mind for
multi-batch parallel hash joins, since that's only a TODO item in the
most recent revision of his patch (maybe I missed something he wrote
on this topic, though). Thomas is working on a revision that resolves
that open item, at which point we'll know more. I understand that a
new revision of his patch that closes out the TODO item isn't too far
from being posted.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Wed, Feb 1, 2017 at 8:46 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> So I'm in favour of this patch, which is relatively simple and give us
>> faster index builds soon.  Eventually we might also be able to have
>> approach 1.  From what I gather, it's entirely possible that we might
>> still need 2 to fall back on in some cases.
>
> Right. And it can form the basis of an implementation of 1, which in
> any case seems to be much more compelling for parallel query, when a
> great deal more can be pushed down, and we are not particularly likely
> to be I/O bound (usually not much writing to the heap, or WAL
> logging).

I ran some tests today.  First I created test tables representing the
permutations of these choices:

Table structure:

  int = Integer key only
  intwide = Integer key + wide row
  text = Text key only (using dictionary words)
  textwide = Text key + wide row

Uniqueness:

  u = each value unique
  d = 10 duplicates of each value

Heap physical order:

  rand = Random
  asc = Ascending order (already sorted)
  desc = Descending order (sorted backwards)

I used 10 million rows for this test run, so that gave me 24 tables of
the following sizes as reported in "\d+":

  int tables = 346MB each
  intwide tables = 1817MB each
  text tables = 441MB each
  textwide tables = 1953MB each

It'd be interesting to test larger tables of course but I had a lot of
permutations to get through.

For each of those tables I ran tests corresponding to the permutations
of these three variables:

Index type:

  uniq = CREATE UNIQUE INDEX ("u" tables only, ie no duplicates)
  nonu = CREATE INDEX ("u" and "d" tables)

Maintenance memory: 1M, 64MB, 256MB, 512MB

Workers: from 0 up to 8

Environment:  EDB test machine "cthulhu", Intel(R) Xeon(R) CPU E7-
8830  @ 2.13GHz, 8 socket, 8 cores (16 threads) per socket, CentOS
7.2, Linux kernel 3.10.0-229.7.2.el7.x86_64, 512GB RAM, pgdata on SSD.
Database initialised with en_US.utf-8 collation, all defaults except
max_wal_size increased to 4GB (otherwise warnings about too frequent
checkpoints) and max_parallel_workers_maintenance = 8.  Testing done
with warm OS cache.

I applied your v2 patch on top of
7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict.  It
still had a couple of harmless conflicts that I was able to deal with
(not code, just some header stuff moving around).

See full results from all permutations attached, but I wanted to
highlight the measurements from 'textwide', 'u', 'nonu' which show
interesting 'asc' numbers (data already sorted).  The 'mem' column is
maintenance_work_mem in megabytes.  The 'w = 0' column shows the time
in seconds for parallel_workers = 0.  The other 'w = N' columns show
times with higher parallel_workers settings, represented as speed-up
relative to the 'w = 0' time.

1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
show speed-up relative to that time):

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
  64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
 256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
 512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x

2. 'rand' = randomised data:

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 130.25 | 1.82x | 2.19x | 2.52x | 2.58x | 2.72x | 2.72x | 2.83x | 2.89x
  64 | 117.36 | 1.80x | 2.20x | 2.43x | 2.47x | 2.55x | 2.51x | 2.59x | 2.69x
 256 | 124.68 | 1.87x | 2.20x | 2.49x | 2.52x | 2.64x | 2.70x | 2.72x | 2.75x
 512 | 115.77 | 1.51x | 1.72x | 2.14x | 2.08x | 2.19x | 2.31x | 2.44x | 2.48x

3. 'desc' = reverse-sorted data:

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 115.19 | 1.88x | 2.39x | 2.78x | 3.50x | 3.62x | 4.20x | 4.19x | 4.39x
  64 | 112.17 | 1.85x | 2.25x | 2.99x | 3.63x | 3.65x | 4.01x | 4.31x | 4.62x
 256 | 119.55 | 1.76x | 2.21x | 2.85x | 3.43x | 3.37x | 3.77x | 4.24x | 4.28x
 512 | 119.50 | 1.85x | 2.19x | 2.87x | 3.26x | 3.28x | 3.74x | 4.24x | 3.93x

The 'asc' effects are much less pronounced when the key is an int.
Here is the equivalent data for 'intwide', 'u', 'nonu':

1.  'asc'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 12.19 | 1.55x | 1.93x | 2.21x | 2.44x | 2.64x | 2.76x | 2.91x | 2.83x
  64 |  7.35 | 1.29x | 1.53x | 1.69x | 1.86x | 1.98x | 2.04x | 2.07x | 2.09x
 256 |  7.34 | 1.26x | 1.47x | 1.64x | 1.79x | 1.92x | 1.96x | 1.98x | 2.02x
 512 |  7.24 | 1.24x | 1.46x | 1.65x | 1.80x | 1.91x | 1.97x | 2.00x | 1.92x

2. 'rand'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 15.16 | 1.56x | 2.01x | 2.32x | 2.57x | 2.73x | 2.87x | 2.95x | 2.91x
  64 | 12.97 | 1.55x | 1.97x | 2.25x | 2.44x | 2.58x | 2.70x | 2.74x | 2.71x
 256 | 13.14 | 1.47x | 1.86x | 2.12x | 2.31x | 2.50x | 2.62x | 2.58x | 2.69x
 512 | 13.61 | 1.48x | 1.91x | 2.22x | 2.37x | 2.55x | 2.65x | 2.73x | 2.73x

3. 'desc'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 13.45 | 1.51x | 1.94x | 2.31x | 2.56x | 2.75x | 2.95x | 3.05x | 3.00x
  64 | 10.27 | 1.42x | 1.82x | 2.05x | 2.30x | 2.46x | 2.59x | 2.64x | 2.65x
 256 | 10.52 | 1.39x | 1.70x | 2.02x | 2.24x | 2.34x | 2.39x | 2.48x | 2.56x
 512 | 10.62 | 1.43x | 1.82x | 2.06x | 2.32x | 2.51x | 2.61x | 2.68x | 2.69x

Full result summary and scripts used for testing attached.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Thu, Feb 9, 2017 at 2:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> You might think about plugging that hole by moving the registry of
> on-destroy functions into the segment itself and making it a shared
> resource.  But ASLR breaks that, especially for loadable modules.  You
> could try to fix that problem, in turn, by storing arguments that can
> later be passed to load_external_function() instead of a function
> pointer per se.  But that sounds pretty fragile because some other
> backend might not try to load the module until after it's attached the
> DSM segment and it might then fail because loading the module runs
> _PG_init() which can throw errors.   Maybe you can think of a way to
> plug that hole too but you're waaaaay over your $8 budget by this
> point.

At the risk of stating the obvious, ISTM that the right way to do
this, at a high level, is to err on the side of unneeded extra
unlink() calls, not leaking files. And, to make the window for problem
("remaining hole that you haven't quite managed to plug") practically
indistinguishable from no hole at all, in a way that's kind of baked
into the API.

It's not like we currently throw an error when there is a problem with
deleting temp files that are no longer needed on resource manager
cleanup. We simply log the fact that it happened, and limp on.

I attach my V8. This does not yet do anything with on_dsm_detach().
I've run out of time to work on it this week, and am starting a new
job next week at VMware, which I'll need time to settle into. So I'm
posting this now, since you can still very much see the direction I'm
going in, and can give me any feedback that you have. If anyone wants
to show me how its done by building on this, and finishing what I have
off, be my guest. The new stuff probably isn't quite as polished as I
would prefer, but time grows short, so I won't withhold it.

Changes:

* Implements refcount thing, albeit in a way that leaves a small
window for double unlink() calls if there is an error during the small
window in which there is worker/leader co-ownership of a BufFile (just
add an "elog(ERROR)" just before leader-as-worker Tuplesort state is
ended within _bt_leafbuild() to see what I mean). This implies that
background workers can be reclaimed once the leader needs to start its
final on-the-fly merge, which is nice. As an example of how that's
nice, this change makes maintenance_work_mem a budget that we more
strictly adhere to.

* Fixes bitrot caused by recent logtape.c bugfix in master branch.

* No local segment is created during unification unless and until one
is required. (In practice, for current use of BufFile infrastructure,
no "local" segment is ever created, even if we force a randomAccess
case using one of the testing GUCs from 0002-* -- we'd have to use
another GUC to *also* force there to be no reclaimation.)

* Better testing. As I just mentioned, we can now force logtape.c to
not reclaim blocks, so you make new local segments as part of a
unified BufFile, which have different considerations from a resource
management point of view. Despite being part of the same "unified"
BufFile from the leader's perspective, it behaves like a local
segment, so it definitely seems like a good idea to have test coverage
for this, at least during development. (I have a pretty rough test
suite that I'm using; development of this patch has been somewhat test
driven.)

* Better encapsulation of BufFile stuff. I am even closer to the ideal
of this whole sharing mechanism being a fairly generic BufFile thing
that logtape.c piggy-backs on without having special knowledge of the
mechanism. It's still true that the mechanism (sharing/unification) is
written principally with logtape.c in mind, but that's just because of
its performance characteristics. Nothing to do with the interface.

* Worked through items raised by Thomas in his 2017-01-30 mail to this thread.

>>>> Secondly, I might not want to be constrained by a
>>>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>>>> cases where I need to shared a number of batch files that is unknown
>>>> at the start of execution time when the DSM segment is sized (I'll
>>>> write about that shortly on the Parallel Shared Hash thread).  Maybe I
>>>> can find a way to get rid of that requirement.  Or maybe it could
>>>> support DSA memory too, but I don't think it's possible to use
>>>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>>>> by the time any given DSM segment's detach hook runs, there's no
>>>> telling which other DSM segments have been detached already, so the
>>>> DSA area may already have partially vanished; some other kind of hook
>>>> that runs earlier would be needed...
>>>
>>> Again, wrench.

I like the wrench analogy too, FWIW.

>> My problem here is that I don't know how many batches I'll finish up
>> creating.  In general that's OK because I can hold onto them as
>> private BufFiles owned by participants with the existing cleanup
>> mechanism, and then share them just before they need to be shared (ie
>> when we switch to processing the next batch so they need to be
>> readable by all).  Now I only ever share one inner and one outer batch
>> file per participant at a time, and then I explicitly delete them at a
>> time that I know to be safe and before I need to share a new file that
>> would involve recycling the slot, and I'm relying on DSM segment scope
>> cleanup only to handle error paths.  That means that in generally I
>> only need space for 2 * P shared BufFiles at a time.  But there is a
>> problem case: when the leader needs to exit early, it needs to be able
>> to transfer ownership of any files it has created, which could be more
>> than we planned for, and then not participate any further in the hash
>> join, so it can't participate in the on-demand sharing scheme.

I think that parallel CREATE INDEX can easily live with the
restriction that we need to know how many shared BufFiles are needed
up front. It will either be 1, or 2 (when there are 2 nbtsort.c
spools, for unique index builds). We can also detect when the limit is
already exceeded early, and back out, just as we do when there are no
parallel workers currently available.

>> Then the next problem is that for each BufFile we have to know how
>> many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
>> ...), which Peter's code currently captures by publishing the file
>> size in the descriptor... but if a fixed size object must describe N
>> BufFiles, where can I put the size of each one?  Maybe I could put it
>> in a header of the file itself (yuck!), or maybe I could decide that I
>> don't care what the size is, I'll simply unlink "foo", then "foo.1",
>> then "foo.2", ... until I get ENOENT.
>
> There's nothing wrong with that algorithm as far as I'm concerned.

I would like to point out, just to be completely clear, that while
this V8 doesn't "do refcounts properly" (it doesn't use a
on_dsm_detach() hook and so on), the only benefit that that would
actually have for parallel CREATE INDEX is that it makes it impossible
that the user could see a spurious ENOENT related log message during
unlink() (I err on the side of doing too much unlinking, not too
little). Which is very unlikely anyway. So, if that's okay for
parallel hash join, as indicated by Robert here, an issue like that
would presumably also be okay for parallel CREATE INDEX. It then
follows that what I'm missing here is something that is only really
needed for the parallel hash join patch anyway.

I really want to help Thomas, and am not shirking what I feel is a
responsibility to assist him. I have every intention of breaking this
down to produce a usable patch that only has the BufFile + resource
managemnt stuff, that follows the interface he sketched as a
requirement for me in his most recent revision of his patch series
("0009-hj-shared-buffile-strawman-v4.patch"). I'm just pointing out
that my patch is reasonably complete as a standalone piece of work
right now, AFAICT.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

10 February 2017, 15:52:57

On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here's my thought process... please tell me where I'm going wrong:
>
> I have been assuming that it's not enough to just deal with this when
> the leader detaches on the theory that other participants will always
> detach first: that probably isn't true in some error cases, and could
> contribute to spurious racy errors where other workers complain about
> disappearing files if the leader somehow shuts down and cleans up
> while a worker is still running.  Therefore we need *some* kind of
> refcounting, whether it's a new kind or a new mechanism based on the
> existing kind.

+1.

> I have also been assuming that we don't want to teach dsm.c directly
> about this stuff; it shouldn't need to know about other modules, so we
> don't want it talking to buffile.c directly and managing a special
> table of files; instead we want a system of callbacks.  Therefore
> client code needs to do something after attaching to the segment in
> each backend.

+1.

> It doesn't matter whether we use an on_dsm_detach() callback and
> manage our own refcount to infer that destruction is imminent, or a
> new on_dsm_destroy() callback which tells us so explicitly: both ways
> we'll need to make sure that anyone who attaches to the segment also
> "attaches" to this shared BufFile manager object inside it, because
> any backend might turn out to be the one that is last to detach.

Not entirely.  In the first case, you don't need the requirement that
everyone who attaches the segment must attach to the shared BufFile
manager.  In the second case, you do.

> That bring us to the race you mentioned.  Isn't it sufficient to say
> that you aren't allowed to do anything that might throw in between
> attaching to the segment and attaching to the SharedBufFileManager
> that it contains?

That would be sufficient, but I think it's not a very good design.  It
means, for example, that nothing between the time you attach to the
segment and the time you attach to this manager can palloc() anything.
So, for example, it would have to happen before ParallelWorkerMain
reaches the call to shm_mq_attach, which kinda sucks because we want
to do that as soon as possible after attaching to the DSM segment so
that errors are reported properly thereafter.  Note that's the very
first thing we do now, except for working out what the arguments to
that call need to be.

Also, while it's currently safe to assume that shm_toc_attach() and
shm_toc_lookup() don't throw errors, I've thought about the
possibility of installing some sort of cache in shm_toc_lookup() to
amortize the cost of lookups, if the number of keys ever got too
large.  And that would then require a palloc().  Generally, backend
code should be free to throw errors.  When it's absolutely necessary
for a short segment of code to avoid that, then we do, but you can't
really rely on any substantial amount of code to be that way, or stay
that way.

And in this case, even if we didn't mind those problems or had some
solution to them, I think that the shared buffer manager shouldn't
have to be something that is whacked directly into parallel.c all the
way at the beginning of the initialization sequence so that nothing
can fail before it happens.  I think it should be an optional data
structure that clients of the parallel infrastructure can decide to
use, or to not use.  It should be at arm's length from the core code,
just like the way ParallelQueryMain() is distinct from
ParallelWorkerMain() and sets up its own set of data structures with
their own set of keys.  All that stuff is happy to happen after
whatever ParallelWorkerMain() feels that it needs to do, even if
ParallelWorkerMain might throw errors for any number of unknown
reasons.  Similarly, I think this new things should be something than
an executor node can decide to create inside its own per-node space --
reserved via ExecParallelEstimate, initialized
ExecParallelInitializeDSM, etc.  There's no need for it to be deeply
coupled to parallel.c itself unless we force that choice by sticking a
no-fail requirement in there.

> Up until two minutes ago I assumed that policy would leave only two
> possibilities: you attach to the DSM segment and attach to the
> SharedBufFileManager successfully or you attach to the DSM segment and
> then die horribly (but not throw) and the postmaster restarts the
> whole cluster and blows all temp files away with RemovePgTempFiles().
> But I see now in the comment of that function that crash-induced
> restarts don't call that because "someone might want to examine the
> temp files for debugging purposes".  Given that policy for regular
> private BufFiles, I don't see why that shouldn't apply equally to
> shared files: after a crash restart, you may have some junk files that
> won't be cleaned up until your next clean restart, whether they were
> private or shared BufFiles.

I think most people (other than Tom) would agree that that policy
isn't really sensible any more; it probably made sense when the
PostgreSQL user community was much smaller and consisted mostly of the
people developing PostgreSQL, but these days it's much more likely to
cause operational headaches than to help a developer debug.
Regardless, I think the primary danger isn't failure to remove a file
(although that is best avoided) but removing one too soon (causing
someone else to error when opening it, or on Windows causing the
delete itself to error out).  It's not really OK for random stuff to
throw errors in corner cases because we were too lazy to ensure that
cleanup operations happen in the right order.

>> I thought the idea was that the structure we're talking about here
>> owns all the files, up to 2 from a leader that wandered off plus up to
>> 2 for each worker.  Last process standing removes them.  Or are you
>> saying each worker only needs 2 files but the leader needs a
>> potentially unbounded number?
>
> Yes, potentially unbounded in rare case.  If we plan for N batches,
> and then run out of work_mem because our estimates were just wrong or
> the distributions of keys is sufficiently skewed, we'll run
> HashIncreaseNumBatches, and that could happen more than once.  I have
> a suite of contrived test queries that hits all the various modes and
> code paths of hash join, and it includes a query that plans for one
> batch but finishes up creating many, and then the leader exits.  I'll
> post that to the other thread along with my latest patch series soon.

Hmm, OK.  So that's going to probably require something where a fixed
amount of DSM can describe an arbitrary number of temp file series.
But that also means this is an even-more-special-purpose tool that
shouldn't be deeply tied into parallel.c so that it can run before any
errors happen.

Basically, I think the "let's write the code between here and here so
it throws no errors" technique is, for 99% of PostgreSQL programming,
difficult and fragile.  We shouldn't rely on it if there is some other
reasonable option.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

16 February 2017, 05:05:14

On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> It might just have been that the table was too small to be an
> effective target for parallel sequential scan with so many workers,
> and so a presorted best case CREATE INDEX, which isn't that different,
> also fails to see much benefit (compared to what you'd see with a
> similar case involving a larger table). In other words, I might have
> jumped the gun in emphasizing issues with hardware and I/O bandwidth
> over issues around data volume (that I/O parallelism is inherently not
> very helpful with these relatively small tables).
>
> As I've pointed out a couple of times before, bigger sorts will be
> more CPU bound because sorting itself has costs that grow
> linearithmically, whereas writing out runs has costs that grow
> linearly. The relative cost of the I/O can be expected to go down as
> input goes up for this reason. At the same time, a larger input might
> make better use of I/O parallelism, which reduces the cost paid in
> latency to write out runs in absolute terms.

Here are some results with your latest patch, using the same test as
before but this time with SCALE=100 (= 100,000,000 rows).  The table
sizes are:

                             List of relations
 Schema |         Name         | Type  |    Owner     | Size  | Description
--------+----------------------+-------+--------------+-------+-------------
 public | million_words        | table | thomas.munro | 42 MB |
 public | some_words           | table | thomas.munro | 19 MB |
 public | test_intwide_u_asc   | table | thomas.munro | 18 GB |
 public | test_intwide_u_desc  | table | thomas.munro | 18 GB |
 public | test_intwide_u_rand  | table | thomas.munro | 18 GB |
 public | test_textwide_u_asc  | table | thomas.munro | 19 GB |
 public | test_textwide_u_desc | table | thomas.munro | 19 GB |
 public | test_textwide_u_rand | table | thomas.munro | 19 GB |

To reduce the number of combinations I did only unique data and built
only non-unique indexes with only 'wide' tuples (= key plus a text
column that holds a 151-character wide string, rather than just the
key), and also didn't bother with the 1MB memory size as suggested.
Here are the results up to 4 workers (a results table going up to 8
workers is attached, since it wouldn't format nicely if I pasted it
here).  Again, the w = 0 time is seconds, the rest show relative
speed-up.  This data was all in the OS page cache because of a dummy
run done first, and I verified with 'sar' that there was exactly 0
reading from the block device.  The CPU was pegged on leader + workers
during sort runs, and then the leader's CPU hovered around 93-98%
during the merge/btree build.  I had some technical problems getting a
cold-cache read-from-actual-disk-each-time test run to work properly,
but can go back and do that again if anyone thinks that would be
interesting data to see.

   tab    | ord  | mem |  w = 0  | w = 1 | w = 2 | w = 3 | w = 4
----------+------+-----+---------+-------+-------+-------+-------
 intwide  | asc  |  64 |   67.91 | 1.26x | 1.46x | 1.62x | 1.73x
 intwide  | asc  | 256 |   67.84 | 1.23x | 1.48x | 1.63x | 1.79x
 intwide  | asc  | 512 |   69.01 | 1.25x | 1.50x | 1.63x | 1.80x
 intwide  | desc |  64 |   98.08 | 1.48x | 1.83x | 2.03x | 2.25x
 intwide  | desc | 256 |   99.87 | 1.43x | 1.80x | 2.03x | 2.29x
 intwide  | desc | 512 |  104.09 | 1.44x | 1.85x | 2.09x | 2.33x
 intwide  | rand |  64 |  138.03 | 1.56x | 2.04x | 2.42x | 2.58x
 intwide  | rand | 256 |  139.44 | 1.61x | 2.04x | 2.38x | 2.56x
 intwide  | rand | 512 |  138.96 | 1.52x | 2.03x | 2.28x | 2.57x
 textwide | asc  |  64 |  207.10 | 1.20x | 1.07x | 1.09x | 1.11x
 textwide | asc  | 256 |  200.62 | 1.19x | 1.06x | 1.04x | 0.99x
 textwide | asc  | 512 |  191.42 | 1.16x | 0.97x | 1.01x | 0.94x
 textwide | desc |  64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x
 textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x
 textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x
 textwide | rand |  64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x
 textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x
 textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x

"textwide" "asc" is nearly an order of magnitude faster than other
initial orders without parallelism, but then parallelism doesn't seem
to help it much.  Also, using more that 64MB doesn't ever seem to help
very much; in the "desc" case it hinders.

I was curious to understand how performance changes if we become just
a bit less correlated (rather than completely uncorrelated or
perfectly inversely correlated), so I tried out a 'banana skin' case:
I took the contents of the textwide asc table and copied it to a new
table, and then moved the 900 words matching 'banana%' to the physical
end of the heap by deleting and reinserting them in one transaction.
I guess if we were to use this technology for CLUSTER, this might be
representative of a situation where you regularly recluster a growing
table.  The results were pretty much like "asc":

   tab    |  ord   | mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4
----------+--------+-----+--------+-------+-------+-------+-------
 textwide | banana |  64 | 213.39 | 1.17x | 1.11x | 1.15x | 1.09x

It's hard to speculate about this, but I guess that a significant
number of indexes in real world databases might be uncorrelated to
insert order.  A newly imported or insert-only table might have one
highly correlated index for a surrogate primary key or time column,
but other indexes might tend to be uncorrelated.  But really, who
knows...  in a kind of textbook perfectly correlated case such as a
time series table with an append-only time or sequence based key, you
might want to use BRIN rather than B-Tree anyway.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

speedup-100.txt

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

16 February 2017, 17:28:32

On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> At the risk of stating the obvious, ISTM that the right way to do
> this, at a high level, is to err on the side of unneeded extra
> unlink() calls, not leaking files. And, to make the window for problem
> ("remaining hole that you haven't quite managed to plug") practically
> indistinguishable from no hole at all, in a way that's kind of baked
> into the API.

I do not think there should be any reason why we can't get the
resource accounting exactly correct here.  If a single backend manages
to remove every temporary file that it creates exactly once (and
that's currently true, modulo system crashes), a group of cooperating
backends ought to be able to manage to remove every temporary file
that any of them create exactly once (again, modulo system crashes).

I do agree that a duplicate unlink() call isn't as bad as a missing
unlink() call, at least if there's no possibility that the filename
could have been reused by some other process, or some other part of
our own process, which doesn't want that new file unlinked.  But it's
messy.  If the seatbelts in your car were to randomly unbuckle, that
would be a safety hazard.  If they were to randomly refuse to
unbuckle, you wouldn't say "that's OK because it's not a safety
hazard", you'd say "these seatbelts are badly designed".  And I think
the same is true of this mechanism.

The way to make this 100% reliable is to set things up so that there
is joint ownership from the beginning and shared state that lets you
know whether the work has already been done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

16 February 2017, 19:45:38

On Thu, Feb 16, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> At the risk of stating the obvious, ISTM that the right way to do
>> this, at a high level, is to err on the side of unneeded extra
>> unlink() calls, not leaking files. And, to make the window for problem
>> ("remaining hole that you haven't quite managed to plug") practically
>> indistinguishable from no hole at all, in a way that's kind of baked
>> into the API.
>
> I do not think there should be any reason why we can't get the
> resource accounting exactly correct here.  If a single backend manages
> to remove every temporary file that it creates exactly once (and
> that's currently true, modulo system crashes), a group of cooperating
> backends ought to be able to manage to remove every temporary file
> that any of them create exactly once (again, modulo system crashes).

I believe that we are fully in agreement here. In particular, I think
it's bad that there is an API that says "caller shouldn't throw an
elog error between these two points", and that will be fixed before
too long. I just think that it's worth acknowledging a certain nuance.

> I do agree that a duplicate unlink() call isn't as bad as a missing
> unlink() call, at least if there's no possibility that the filename
> could have been reused by some other process, or some other part of
> our own process, which doesn't want that new file unlinked.  But it's
> messy.  If the seatbelts in your car were to randomly unbuckle, that
> would be a safety hazard.  If they were to randomly refuse to
> unbuckle, you wouldn't say "that's OK because it's not a safety
> hazard", you'd say "these seatbelts are badly designed".  And I think
> the same is true of this mechanism.

If it happened in the lifetime of only one out of a million seatbelts
manufactured, and they were manufactured at a competitive price (not
over-engineered), I probably wouldn't say that. The fact that the
existing resource manger code only LOGs most temp file related
failures suggests to me that that's a "can't happen" condition, but we
still hedge. I would still like to hedge against even (theoretically)
impossible risks.

Maybe I'm just being pedantic here, since we both actually want the
code to do the same thing.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

16 February 2017, 20:24:20

On Wed, Feb 15, 2017 at 6:05 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here are some results with your latest patch, using the same test as
> before but this time with SCALE=100 (= 100,000,000 rows).

Cool.

> To reduce the number of combinations I did only unique data and built
> only non-unique indexes with only 'wide' tuples (= key plus a text
> column that holds a 151-character wide string, rather than just the
> key), and also didn't bother with the 1MB memory size as suggested.
> Here are the results up to 4 workers (a results table going up to 8
> workers is attached, since it wouldn't format nicely if I pasted it
> here).

I think that you are still I/O bound in a way that is addressable by
adding more disks. The exception is the text cases, where the patch
does best. (I don't place too much emphasis on that because I know
that in the long term, we'll have abbreviated keys, which will take
some of the sheen off of that.)

> Again, the w = 0 time is seconds, the rest show relative
> speed-up.

I think it's worth pointing out that while there are cases where we
see no benefit from going from 4 to 8 workers, it tends to hardly hurt
at all, or hardly help at all. It's almost irrelevant that the number
of workers used is excessive, at least up until the point when all
cores have their own worker. That's a nice quality for this to have --
the only danger is that we use parallelism when we shouldn't have at
all, because the serial case could manage an internal sort, and the
sort was small enough that that could be a notable factor.

> "textwide" "asc" is nearly an order of magnitude faster than other
> initial orders without parallelism, but then parallelism doesn't seem
> to help it much.  Also, using more that 64MB doesn't ever seem to help
> very much; in the "desc" case it hinders.

Maybe it's CPU cache efficiency? There are edge cases where multiple
passes are faster than one pass. That'ks the only explanation I can
think of.

> I was curious to understand how performance changes if we become just
> a bit less correlated (rather than completely uncorrelated or
> perfectly inversely correlated), so I tried out a 'banana skin' case:
> I took the contents of the textwide asc table and copied it to a new
> table, and then moved the 900 words matching 'banana%' to the physical
> end of the heap by deleting and reinserting them in one transaction.

A likely problem with that is that most runs will actually not have
their own banana skin, so to speak. You only see a big drop in
performance when every quicksort operation has presorted input, but
with one or more out-of-order tuples at the end. In order to see a
really unfortunate case with parallel CREATE INDEX, you'd probably
have to have enough memory that workers don't need to do their own
merge (and so worker's work almost entirely consists of one big
quicksort operation), with enough "banana skin heap pages" that the
parallel heap scan is pretty much guaranteed to end up giving "banana
skin" (out of order) tuples to every worker, making all of them "have
a slip" (throw away a huge amount of work as the presorted
optimization is defeated right at the end of its sequential read
through).

A better approach would be to have several small localized areas
across input where input tuples are a little out of order. That would
probably show that the performance is pretty in line with random
cases.

> It's hard to speculate about this, but I guess that a significant
> number of indexes in real world databases might be uncorrelated to
> insert order.

That would certainly be true with text, where we see a risk of (small)
regressions.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

16 February 2017, 20:30:42

On Thu, Feb 16, 2017 at 11:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Maybe I'm just being pedantic here, since we both actually want the
> code to do the same thing.

Pedantry from either of us?  Nah...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

01 March 2017, 12:29:23

On Sat, Feb 11, 2017 at 1:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Yes, potentially unbounded in rare case.  If we plan for N batches,
>> and then run out of work_mem because our estimates were just wrong or
>> the distributions of keys is sufficiently skewed, we'll run
>> HashIncreaseNumBatches, and that could happen more than once.  I have
>> a suite of contrived test queries that hits all the various modes and
>> code paths of hash join, and it includes a query that plans for one
>> batch but finishes up creating many, and then the leader exits.  I'll
>> post that to the other thread along with my latest patch series soon.
>
> Hmm, OK.  So that's going to probably require something where a fixed
> amount of DSM can describe an arbitrary number of temp file series.
> But that also means this is an even-more-special-purpose tool that
> shouldn't be deeply tied into parallel.c so that it can run before any
> errors happen.
>
> Basically, I think the "let's write the code between here and here so
> it throws no errors" technique is, for 99% of PostgreSQL programming,
> difficult and fragile.  We shouldn't rely on it if there is some other
> reasonable option.

I'm testing a patch that lets you set up a fixed sized
SharedBufFileSet object in a DSM segment, with its own refcount for
the reason you explained.  It supports a dynamically expandable set of
numbered files, so each participant gets to export file 0, file 1,
file 2 and so on as required, in any order.  I think this should suit
both Parallel Tuplesort which needs to export just one file from each
participant, and Parallel Shared Hash which doesn't know up front how
many batches it will produce.  Not quite ready but I will post a
version tomorrow to get Peter's reaction.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

06 March 2017, 17:01:51

On Wed, Mar 1, 2017 at 10:29 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I'm testing a patch that lets you set up a fixed sized
> SharedBufFileSet object in a DSM segment, with its own refcount for
> the reason you explained.  It supports a dynamically expandable set of
> numbered files, so each participant gets to export file 0, file 1,
> file 2 and so on as required, in any order.  I think this should suit
> both Parallel Tuplesort which needs to export just one file from each
> participant, and Parallel Shared Hash which doesn't know up front how
> many batches it will produce.  Not quite ready but I will post a
> version tomorrow to get Peter's reaction.

See 0007-hj-shared-buf-file-v6.patch in the v6 tarball in the parallel
shared hash thread.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

13 March 2017, 01:05:40

On Thu, Feb 16, 2017 at 8:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I do not think there should be any reason why we can't get the
>> resource accounting exactly correct here.  If a single backend manages
>> to remove every temporary file that it creates exactly once (and
>> that's currently true, modulo system crashes), a group of cooperating
>> backends ought to be able to manage to remove every temporary file
>> that any of them create exactly once (again, modulo system crashes).
>
> I believe that we are fully in agreement here. In particular, I think
> it's bad that there is an API that says "caller shouldn't throw an
> elog error between these two points", and that will be fixed before
> too long. I just think that it's worth acknowledging a certain nuance.

I attach my V9 of the patch. I came up some stuff for the design of
resource management that I think meets every design goal that we have
for shared/unified BufFiles:

* Avoids both resource leaks, and spurious double-freeing of resources
(e.g., a second unlink() for a file from a different process) when
there are errors. The latter problem was possible before, a known
issue with V8 of the patch. I believe that this revision avoids these
problems in a way that is *absolutely bulletproof* in the face of
arbitrary failures (e.g., palloc() failure) in any process at any
time. Although, be warned that there is a remaining open item
concerning resource management in the leader-as-worker case, which I
go into below.

There are now what you might call "critical sections" in one function.
That is, there are points where we cannot throw an error (without a
BEGIN_CRIT_SECTION()!), but those are entirely confined to unification
code within the leader, where we can be completely sure that no error
can be raised. The leader can even fail before some but not all of a
particular worker's segments are in its local resource manager, and we
still do the right thing. I've been testing this by adding code that
randomly throws errors at points interspersed throughout worker and
leader unification hand-off points. I then leave this stress-test
build to run for a few hours, while monitoring for leaked files and
spurious fd.c reports of double-unlink() and similar issues. Test
builds change LOG to PANIC within several places in fd.c, while
MAX_PHYSICAL_FILESIZE was reduced from 1GiB to BLCKSZ.

All of these guarantees are made without any special care from caller
to buffile.c. The only V9 change to tuplesort.c or logtape.c in this
general area is that they have to pass a dynamic shared memory segment
to buffile.c, so that it can register a new callback. That's it. This
may be of particular interest to Thomas. All complexity is confined to
buffile.c.

* No expansion in the use of shared memory to manage resources.
BufFile refcount is still per-worker. The role of local resource
managers is unchanged.

* Additional complexity over and above ordinary BufFile resource
management is confined to the leader process and its on_dsm_detach()
callback. Only the leader registers a callback. Of course, refcount
management within BufFileClose() can still take place in workers, but
that isn't something that we rely on (that's only for non-error
paths). In general, worker processes mostly have resource managers
managing their temp file segments as a thing that has nothing to do
with BufFiles (BufFiles are still not owned by resowner.c/fd.c --
they're blissfully unaware of all of this stuff).

* In general, unified BufFiles can still be treated in exactly the
same way as conventional BufFiles, and things just work, without any
special cases being exercised internally.

There is still an open item here, though: The leader-as-worker
Tuplesortstate, a special case, can still leak files. So,
stress-testing will only show the patch to be completely robust
against resource leaks when nbtsort.c is modified to enable
FORCE_SINGLE_WORKER testing. Despite the name FORCE_SINGLE_WORKER, you
can also modify that file to force there to be arbitrary-many workers
requested (just change "requested = 1" to something else). The
leader-as-worker problem is avoided because we don't have the leader
participating as a worker this way, which would otherwise present
issues for resowner.c that I haven't got around to fixing just yet. It
isn't hard to imagine why this is -- one backend with two FDs for
certain fd.c temp segments is just going to cause problems for
resowner.c without additional special care. Didn't seem worth blocking
on that. I want to prove that my general approach is workable. That
problem is confined to one backend's resource manager when it is the
leader participating as a worker. It is not a refcount problem. The
simplest solution here would be to ban the leader-as-worker case by
contract. Alternatively, we could pass fd.c segments from the
leader-as-worker Tuplesortstate's BufFile to the leader
Tuplesortstate's BufFile without opening or closing anything. This
way, there will be no second vFD entry for any segment at any time.

I've also made several changes to the cost model, changes agreed to
over on the "Cost model for parallel CREATE INDEX" thread. No need for
a recap on what those changes are here. In short, things have been
*significantly* simplified in that area.

Finally, note that I decided to throw out more code within
tuplesort.c. Now, a parallel leader is a thing that is explicitly set
up to be exactly consistent with a conventional/serial external sort
whose merge is about to begin. In particular, it now uses mergeruns().

Robert said that he thinks that this is a patch that is to some degree
a parallelism patch, and to some degree about sorting. I'd say that by
now, it's roughly 5% about sorting, in terms of the proportion of code
that expressly considers sorting. Most of the new stuff in tuplesort.c
is about managing dependencies between participating backends. I've
really focused on avoiding new special cases, especially with V9.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

13 March 2017, 04:50:23

On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> There is still an open item here, though: The leader-as-worker
> Tuplesortstate, a special case, can still leak files.

I phrased this badly. What I mean is that there can be instances where
temp files are left on disk following a failure such as palloc() OOM;
no backend ends up doing an unlink() iff a leader-as-worker
Tuplesortstate was used and we get unlucky. I did not mean a leak of
virtual or real file descriptors, which would see Postgres print a
refcount leak warning from resowner.c. Naturally, these "leaked" files
will eventually be deleted by the next restart of the server at the
latest, within RemovePgTempFiles(). Note also that a duplicate
unlink() (with annoying LOG message) is impossible under any
circumstances with V9, regardless of whether or not a leader-as-worker
Tuplesort state is involved.

Anyway, I was sure that I needed to completely nail this down in order
to be consistent with existing guarantees, but another look at
OpenTemporaryFile() makes me doubt that. ResourceOwnerEnlargeFiles()
is called, which itself uses palloc(), which can of course fail. There
are remarks over that function within resowner.c about OOM:

/** Make sure there is room for at least one more entry in a ResourceOwner's* files reference array.** This is separate
fromactually inserting an entry because if we run out* of memory, it's critical to do so *before* acquiring the
resource.*/
void
ResourceOwnerEnlargeFiles(ResourceOwner owner)
{ ...
}

But this happens after OpenTemporaryFileInTablespace() has already
returned. Taking care to allocate memory up-front here is motivated by
keeping the vFD cache entry and current resource owner in perfect
agreement about the FD_XACT_TEMPORARY-ness of a file, and that's it.
It's *not* true that there is a broader sense in which
OpenTemporaryFile() is atomic, which for some reason I previously
believed to be the case.

So, I haven't failed to prevent an outcome that wasn't already
possible. It doesn't seem like it would be that hard to fix this, and
then have the parallel tuplesort patch live up to that new higher
standard. But, it's possible that Tom or maybe someone else would
consider that a bad idea, for roughly the same reason that we don't
call RemovePgTempFiles() for *crash* induced restarts, as mentioned by
Thomas up-thead:
* NOTE: we could, but don't, call this during a post-backend-crash restart* cycle. The argument for not doing it is
thatsomeone might want to examine* the temp files for debugging purposes. This does however mean that*
OpenTemporaryFilehad better allow for collision with an existing temp* file name.*/

void
RemovePgTempFiles(void)
{ ...
}

Note that I did put some thought into making sure OpenTemporaryFile()
does the right thing with collisions with existing temp files. So,
maybe the right thing is to do nothing at all. I don't have strong
feelings either way on this question.

--
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

20 March 2017, 04:03:51

On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I attach my V9 of the patch. I came up some stuff for the design of
> resource management that I think meets every design goal that we have
> for shared/unified BufFiles:

Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should
now pass -1 as the index block argument to compute_parallel_worker(),
just as all callers that aren't parallel index scan do after that
commit. This issue caused V9 to never choose parallel CREATE INDEX
within nbtsort.c. There was also a small amount of bitrot.

Attached V10 fixes this regression. I also couldn't resist adding a
few new assertions that I thought were worth having to buffile.c, plus
dedicated wait events for parallel tuplesort. And, I fixed a silly bug
added in V9 around where worker_wait() should occur.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Wed, Mar 22, 2017 at 3:19 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Wed, Mar 22, 2017 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I disagree with that. It is a
>> trade-off, I suppose. I have now run out of time to work through it
>> with you or Thomas, though.
>
> Bummer.

I'm going to experiment with refactoring the v10 parallel CREATE INDEX
patch to use the SharedBufFileSet interface from
hj-shared-buf-file-v8.patch today and see what problems I run into.

As per the earlier discussion in the thread, I did experiment using
BufFileSet interface from parallel-hash-v18.patchset. I took the reference
of parallel-hash other patches to understand the BufFileSet APIs, and
incorporate the changes to parallel create index.

In order to achieve the same:

- Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
parallel-hash-v18.patchset.
- Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
INDEX v10 patch.
- incorporate the BufFileSet API to the parallel tuple sort for CREATE INDEX.
- Changes into few existing functions as well as added few to support the
BufFileSet changes.

To check the performance, I used the similar test which Peter posted in
his earlier thread. which is:

Machine: power2 machine with 512GB of RAM

Setup:

CREATE TABLE parallel_sort_test AS
    SELECT hashint8(i) randint,
    md5(i::text) collate "C" padding1,
    md5(i::text || '2') collate "C" padding2
    FROM generate_series(0, 1e9::bigint) i;

vacuum ANALYZE parallel_sort_test;

postgres=# show max_parallel_workers_per_gather;
max_parallel_workers_per_gather
---------------------------------
8
(1 row)

postgres=# show maintenance_work_mem;
maintenance_work_mem
----------------------
8GB
(1 row)

postgres=# show max_wal_size ;
max_wal_size
--------------
4GB
(1 row)

CREATE INDEX serial_idx ON parallel_sort_test (randint);

Without patch:

Time: 3430054.220 ms (57:10.054)

With patch (max_parallel_workers_maintenance = 8):

Time: 1163445.271 ms (19:23.445)

Thanks to my colleague Thomas Munro for his help and off-line discussion
for the patch.

Attaching v11 patch and trace_sort output for the test.

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

20 September 2017, 02:47:34

On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> As per the earlier discussion in the thread, I did experiment using
> BufFileSet interface from parallel-hash-v18.patchset.  I took the reference
> of parallel-hash other patches to understand the BufFileSet APIs, and
> incorporate the changes to parallel create index.
>
> In order to achieve the same:
>
> - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> parallel-hash-v18.patchset.
> - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> INDEX v10 patch.
> - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> INDEX.
> - Changes into few existing functions as well as added few to support the
> BufFileSet changes.

I'm glad that somebody is working on this. (Someone closer to the more
general work on shared/parallel BufFile infrastructure than I am.)

I do have some quick feedback, and I hope to be able to provide that
to both you and Thomas, as needed to see this one through. I'm not
going to get into the tricky details around resource management just
yet. I'll start with some simpler questions, to get a general sense of
the plan here.

I gather that you're at least aware that your v11 of the patch doesn't
preserve randomAccess support for parallel sorts, because you didn't
include my 0002-* testing GUCs patch, which was specifically designed
to make various randomAccess stuff testable. I also figured this to be
true because I noticed this FIXME among (otherwise unchanged)
tuplesort code:

> +static void
> +leader_takeover_tapes(Tuplesortstate *state)
> +{
> +   Sharedsort *shared = state->shared;
> +   int         nLaunched = state->nLaunched;
> +   int         j;
> +
> +   Assert(LEADER(state));
> +   Assert(nLaunched >= 1);
> +   Assert(nLaunched == shared->workersFinished);
> +
> +   /*
> +    * Create the tapeset from worker tapes, including a leader-owned tape at
> +    * the end.  Parallel workers are far more expensive than logical tapes,
> +    * so the number of tapes allocated here should never be excessive. FIXME
> +    */
> +   inittapestate(state, nLaunched + 1);
> +   state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> +                                         state->fileset, state->worker);

It's not surprising to me that you do not yet have this part working,
because much of my design was about changing as little as possible
above the BufFile interface, in order for tuplesort.c (and logtape.c)
code like this to "just work" as if it was the serial case. It doesn't
look like you've added the kind of BufFile multiplexing code that I
expected to see in logtape.c. This is needed to compensate for the
code removed from fd.c and buffile.c. Perhaps it would help me to go
look at Thomas' latest parallel hash join patch -- did it gain some
kind of transparent multiplexing ability that you actually (want to)
use here?

Though randomAccess isn't used by CREATE INDEX in general, and so not
supporting randomAccess within tuplesort.c for parallel callers
doesn't matter as far as this CREATE INDEX user-visible feature is
concerned, I still believe that randomAccess is important (IIRC,
Robert thought so too). Specifically, it seems like a good idea to
have randomAccess support, both on general principle (why should the
parallel case be different?), and because having it now will probably
enable future enhancements to logtape.c. Enhancements that have it
manage parallel sorts based on partitioning/distribution/bucketing
[1]. I'm pretty sure that partitioning-based parallel sort is going to
become very important in the future, especially for parallel
GroupAggregate. The leader needs to truly own the tapes it reclaims
from workers in order for all of this to work.

Questions on where you're going with randomAccess support:

1. Is randomAccess support a goal for you here at all?

2. If so, is preserving eager recycling of temp file space during
randomAccess (materializing a final output tape within the leader)
another goal for you here? Do we need to preserve that property of
serial external sorts, too, so that it remains true that logtape.c
ensures that "the total space usage is essentially just the actual
data volume, plus insignificant bookkeeping and start/stop overhead"?
(I'm quoting from master's logtape.c header comments.)

3. Any ideas on next steps in support of those 2 goals? What problems
do you foresee, if any?

> CREATE INDEX serial_idx ON parallel_sort_test (randint);
>
> Without patch:
>
> Time: 3430054.220 ms (57:10.054)
>
> With patch (max_parallel_workers_maintenance  = 8):
>
> Time: 1163445.271 ms (19:23.445)

This looks very similar to my v10. While I will need to follow up on
this, to make sure, it seems likely that this patch has exactly the
same performance characteristics as v10.

Thanks

[1]
https://wiki.postgresql.org/wiki/Parallel_External_Sort#Partitioning_for_parallelism_.28parallel_external_sort_beyond_CREATE_INDEX.29
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

20 September 2017, 12:32:46

On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> As per the earlier discussion in the thread, I did experiment using
> BufFileSet interface from parallel-hash-v18.patchset. I took the reference
> of parallel-hash other patches to understand the BufFileSet APIs, and
> incorporate the changes to parallel create index.
>
> In order to achieve the same:
>
> - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> parallel-hash-v18.patchset.
> - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> INDEX v10 patch.
> - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> INDEX.
> - Changes into few existing functions as well as added few to support the
> BufFileSet changes.

I'm glad that somebody is working on this. (Someone closer to the more
general work on shared/parallel BufFile infrastructure than I am.)

I do have some quick feedback, and I hope to be able to provide that
to both you and Thomas, as needed to see this one through. I'm not
going to get into the tricky details around resource management just
yet. I'll start with some simpler questions, to get a general sense of
the plan here.

Thanks Peter.

I gather that you're at least aware that your v11 of the patch doesn't
preserve randomAccess support for parallel sorts, because you didn't
include my 0002-* testing GUCs patch, which was specifically designed
to make various randomAccess stuff testable. I also figured this to be
true because I noticed this FIXME among (otherwise unchanged)
tuplesort code:

Yes, I haven't touched the randomAccess part yet. My initial goal was

to incorporate the BufFileSet api's here.

> +static void
> +leader_takeover_tapes(Tuplesortstate *state)
> +{
> + Sharedsort *shared = state->shared;
> + int nLaunched = state->nLaunched;
> + int j;
> +
> + Assert(LEADER(state));
> + Assert(nLaunched >= 1);
> + Assert(nLaunched == shared->workersFinished);
> +
> + /*
> + * Create the tapeset from worker tapes, including a leader-owned tape at
> + * the end. Parallel workers are far more expensive than logical tapes,
> + * so the number of tapes allocated here should never be excessive. FIXME
> + */
> + inittapestate(state, nLaunched + 1);
> + state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> + state->fileset, state->worker);

It's not surprising to me that you do not yet have this part working,
because much of my design was about changing as little as possible
above the BufFile interface, in order for tuplesort.c (and logtape.c)
code like this to "just work" as if it was the serial case.

Right. I just followed your design in the your earlier patches.

It doesn't
look like you've added the kind of BufFile multiplexing code that I
expected to see in logtape.c. This is needed to compensate for the
code removed from fd.c and buffile.c. Perhaps it would help me to go
look at Thomas' latest parallel hash join patch -- did it gain some
kind of transparent multiplexing ability that you actually (want to)
use here?

Sorry, I didn't get this part. Are you talking about the your patch changes

into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes

related to ltsUnify() ? If that's the case, I don't think it require with the

BufFileSet. Correct me if I am wrong here.

Though randomAccess isn't used by CREATE INDEX in general, and so not
supporting randomAccess within tuplesort.c for parallel callers
doesn't matter as far as this CREATE INDEX user-visible feature is
concerned, I still believe that randomAccess is important (IIRC,
Robert thought so too). Specifically, it seems like a good idea to
have randomAccess support, both on general principle (why should the
parallel case be different?), and because having it now will probably
enable future enhancements to logtape.c. Enhancements that have it
manage parallel sorts based on partitioning/distribution/bucketing
[1]. I'm pretty sure that partitioning-based parallel sort is going to
become very important in the future, especially for parallel
GroupAggregate. The leader needs to truly own the tapes it reclaims
from workers in order for all of this to work.

First application for the tuplesort here is CREATE INDEX and that doesn't

need randomAccess. But as you said and in the thread its been discussed,
randomAccess is an important and we should sure put an efforts to support

the same.

Questions on where you're going with randomAccess support:

1. Is randomAccess support a goal for you here at all?

2. If so, is preserving eager recycling of temp file space during
randomAccess (materializing a final output tape within the leader)
another goal for you here? Do we need to preserve that property of
serial external sorts, too, so that it remains true that logtape.c
ensures that "the total space usage is essentially just the actual
data volume, plus insignificant bookkeeping and start/stop overhead"?
(I'm quoting from master's logtape.c header comments.)

3. Any ideas on next steps in support of those 2 goals? What problems
do you foresee, if any?

To be frank its too early for me to comment anything in this area. I need

to study this more closely. As an initial goal I was just focused on
understanding the current implementation of the patch and incorporate
the BufFileSet APIs.

> CREATE INDEX serial_idx ON parallel_sort_test (randint);
>
> Without patch:
>
> Time: 3430054.220 ms (57:10.054)
>
> With patch (max_parallel_workers_maintenance = 8):
>
> Time: 1163445.271 ms (19:23.445)

This looks very similar to my v10. While I will need to follow up on
this, to make sure, it seems likely that this patch has exactly the
same performance characteristics as v10.

Its 2.96x, more or less similar to your v10. Might be some changes due

to different testing environment.

Thanks

[1] https://wiki.postgresql.org/wiki/Parallel_External_Sort#Partitioning_for_parallelism_.28parallel_external_sort_beyond_CREATE_INDEX.29
--
Peter Geoghegan

Thanks,

Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

20 September 2017, 18:17:19

On Wed, Sep 20, 2017 at 5:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> First application for the tuplesort here is CREATE INDEX and that doesn't
> need randomAccess. But as you said and in the thread its been discussed,
> randomAccess is an important and we should sure put an efforts to support
> the same.

There's no direct benefit of working on randomAccess support unless we
have some code that wants to use that support for something.  Indeed,
it would just leave us with code we couldn't test.

While I do agree that there are probably use cases for randomAccess, I
think what we should do right now is try to get this patch reviewed
and committed so that we have parallel CREATE INDEX for btree indexes.
And in so doing, let's keep it as simple as possible.  Parallel CREATE
INDEX for btree indexes is a great feature without adding any more
complexity.

Later, anybody who wants to work on randomAccess support -- and
whatever planner and executor changes are needed to make effective use
of it -- can do so.  For example, one can imagine a plan like this:

Gather
-> Merge Join -> Parallel Index Scan -> Parallel Sort   -> Parallel Seq Scan

If the parallel sort reads out all of the output in every worker, then
it becomes legal to do this kind of thing -- it would end up, I think,
being quite similar to Parallel Hash.  However, there's some question
in my mind as to whether want to do this or, say, hash-partition both
relations and then perform separate joins on each partition.  The
above plan is clearly better than what we can do today, where every
worker would have to repeat the sort, ugh, but I don't know if it's
the best plan.  Fortunately, to get this patch committed, we don't
have to figure that out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

30 September 2017, 02:36:55

On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.

This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.

> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
> related to ltsUnify() ?  If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.

I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.

> To be frank its too early for me to comment anything in this area.  I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.

Fair enough.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

10 October 2017, 12:23:41

On Sat, Sep 30, 2017 at 5:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.

This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.

Sorry for delay, here is rebase version of patch.

> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes
> related to ltsUnify() ? If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.

I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.

Hmm okay.

> To be frank its too early for me to comment anything in this area. I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.

Fair enough.

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

0001-Add-parallel-B-tree-index-build-sorting_v12.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

26 October 2017, 14:22:16

Attaching the re based patch according to the v22 parallel-hash patch sets

Thanks

On Tue, Oct 10, 2017 at 2:53 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

On Sat, Sep 30, 2017 at 5:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.

This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.

Sorry for delay, here is rebase version of patch.

> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes
> related to ltsUnify() ? If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.

I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.

Hmm okay.

> To be frank its too early for me to comment anything in this area. I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.

Fair enough.

Thanks,

--
Rushabh Lathia
www.EnterpriseDB.com

Rushabh Lathia

Attachment

0001-Add-parallel-B-tree-index-build-sorting_v12_rebase.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

01 November 2017, 01:29:40

On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attaching the re based patch according to the v22 parallel-hash patch sets

I took a quick look at this today, and noticed a few issues:

* make_name() is used to name files in sharedtuplestore.c, which is
what is passed to BufFileOpenShared() for parallel hash join. Your
using your own logic for that within the equivalent logtape.c call to
BufFileOpenShared(), presumably because make_name() wants to identify
participants by PID rather than by an ordinal identifier number.

I think that we need some kind of central registry for things that use
shared buffiles. It could be that sharedtuplestore.c is further
generalized to support this, or it could be that they both call
something else that takes care of naming. It's not okay to have this
left to random chance.

You're going to have to ask Thomas about this. You should also use
MAXPGPATH for the char buffer on the stack.

* This logtape.c comment needs to be updated, as it's no longer true:
* successfully. In general, workers can take it that the leader will* reclaim space in files under their ownership,
andso should not* reread from tape.

* Robert hated the comment changes in the header of nbtsort.c. You
might want to change it back, because he is likely to be the one that
commits this.

* You should look for similar comments in tuplesort.c (IIRC a couple
of places will need to be revised).

* tuplesort_begin_common() should actively reject a randomAccess
parallel case using elog(ERROR).

* tuplesort.h should note that randomAccess isn't supported, too.

* What's this all about?:

+ /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
+ #define GetSharedBufFileSet(shared) \
+ ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

You can't just cast from one type to the other without regard for the
underling size of the shared memory buffer, which is what this looks
like to me. This only fails to crash because you're only abusing the
last member in the tapes array for this purpose, and there happens to
be enough shared memory slop that you get away with it. I'm pretty
sure that ltsUnify() ends up clobbering the last/leader tape, which is
a place where BufFileSet is also used, so this is just wrong. You
should rethink the shmem structure a little bit.

* There is still that FIXME comment within leader_takeover_tapes(). I
believe that you should still have a leader tape (at least in local
memory in the leader), even though you'll never be able to do anything
with it, since randomAccess is no longer supported. You can remove the
FIXME, and just note that you have a leader tape to be consistent with
the serial case, though recognize that it's not useful. Note that even
with randomAccess, we always had the leader tape, so it's not that
different, really.

I suppose it might make sense to make shared->tapes not have a leader
tape. It hardly matters -- perhaps you should leave it there in order
to keep the code simple, as you'll be keeping the leader tape in local
memory, too. (But it still won't fly to continue to clobber it, of
course -- you still need to find a dedicated place for BufFileSet in
shared memory.)

That's all I have right now.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

01 November 2017, 03:07:39

On Wed, Nov 1, 2017 at 11:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
> <rushabh.lathia@gmail.com> wrote:
>> Attaching the re based patch according to the v22 parallel-hash patch sets
>
> I took a quick look at this today, and noticed a few issues:
>
> * make_name() is used to name files in sharedtuplestore.c, which is
> what is passed to BufFileOpenShared() for parallel hash join. Your
> using your own logic for that within the equivalent logtape.c call to
> BufFileOpenShared(), presumably because make_name() wants to identify
> participants by PID rather than by an ordinal identifier number.

So that's this bit:

+ pg_itoa(worker, filename);
+ lts->pfile = BufFileCreateShared(fileset, filename);

... and:

+ pg_itoa(i, filename);
+ file = BufFileOpenShared(fileset, filename);

What's wrong with using a worker number like this?

> I think that we need some kind of central registry for things that use
> shared buffiles. It could be that sharedtuplestore.c is further
> generalized to support this, or it could be that they both call
> something else that takes care of naming. It's not okay to have this
> left to random chance.

It's not random choice: buffile.c creates a uniquely named directory
(or directories, if you have more than one location configured in the
temp_tablespaces GUC) to hold all the backing files involved in each
BufFileSet.  Naming of BufFiles within the BufFileSet is the caller's
problem, and a worker number seems like a reasonable choice to me.  It
won't collide with a concurrent parallel CREATE INDEX because that'll
be using its own BufFileSet.

> You're going to have to ask Thomas about this.  You should also use
> MAXPGPATH for the char buffer on the stack.

Here's a summary of namespace management scheme I currently have at
the three layers fd.c, buffile.c, sharedtuplestore.c:

1.  fd.c has new lower-level functions provides
PathNameCreateTemporaryFile(const char *path) and
PathNameOpenTemporaryFile(const char *path).  It also provides
PathNameCreateTemporaryDir().  Clearly callers of these interfaces
will need to be very careful about managing the names they use.
Callers also own the problem of cleaning up files, since there is no
automatic cleanup of files created this way.  My intention was that
these facilities would *only* be used by BufFileSet, since it has
machinery to manage those things.

2.  buffile.c introduces BufFileSet, which is conceptually a set of
BufFiles that can be shared by multiple backends with DSM
segment-scoped cleanup.  It is implemented as a set of directories:
one for each tablespace in temp_tablespaces.  It controls the naming
of those directories.  The BufFileSet directories are named similarly
to fd.c's traditional temporary file names using the usual recipe
"pgsql_tmp" + PID + per-process counter but have an additional ".set"
suffix.  RemovePgTempFilesInDir() recognises directories with that
prefix and suffix as junk left over from a crash when cleaning up.  I
suppose it's that knowledge about reserved name patterns and cleanup
that you are thinking of as a central registry?  As for the BufFiles
that are in a BufFileSet, buffile.c has no opinion on that: the
calling code (parallel CREATE INDEX, sharedtuplestore.c, ...) is
responsible for coming up with its own scheme.  If parallel CREATE
INDEX wants to name shared BufFiles "walrus" and "banana", that's OK
by me, and those files won't collide with anything in another
BufFileSet because each BufFileSet has its own directory (-ies).

One complaint about the current coding that someone might object to:
MakeSharedSegmentPath() just dumps the caller's BufFile name into a
path without sanitisation: I should fix that so that we only accept
fairly limited strings here.  Another complaint is that perhaps fd.c
knows too much about buffile.c's business.  For example,
RemovePgTempFilesInDir() knows about the ".set" directories created by
buffile.c, which might be called a layering violation.  Perhaps the
set/directory logic should move entirely into fd.c, so you'd call
FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
Thoughts?

3.  sharedtuplestore.c takes a caller-supplied BufFileSet and creates
its shared BufFiles in there.  Earlier versions created and owned a
BufFileSet, but in the current Parallel Hash patch I create loads of
separate SharedTuplestore objects but I didn't want to create load of
directories to back them, so you can give them all the same
BufFileSet.  That works because SharedTuplestores are also given a
name, and it's the caller's job (in my case nodeHash.c) to make sure
the SharedTuplestores are given unique names within the same
BufFileSet.  For Parallel Hash you'll see names like 'i3of8' (inner
batch 3 of 8).  There is no need for there to be in any sort of
central registry for that though, because it rides on top of the
guarantees from 2 above: buffile.c will put those files into a
uniquely named directory, and that works as long as no one else is
allowed to create files or directories in the temp directory that
collide with its reserved pattern /^pgsql_tmp.+\.set$/.  For the same
reason, parallel CREATE INDEX is free to use worker numbers as BufFile
names, since it has its own BufFileSet to work within.

> * What's this all about?:
>
> + /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
> + #define GetSharedBufFileSet(shared)                    \
> +   ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

In an earlier version, BufFileSet was one of those annoying data
structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an
incomplete type (declared but not defined in the includable header),
and here it was being used "inside" (or rather after) SharedSort,
which *itself* had a FLEXIBLE_ARRAY_MEMBER.  The reason for the
variable sized object was that I needed all backends to agree on the
set of temporary tablespace OIDs, of which there could be any number,
but I also needed a 'flat' (pointer-free) object I could stick in
relocatable shared memory.  In the newest version I changed that
flexible array to tablespaces[8], because 8 should be enough
tablespaces for anyone (TM).  I don't really believe anyone uses
temp_tablespaces for IO load balancing anymore and I hate code like
the above.  So I think Rushabh should now remove the above-quoted code
and just use a BufFileSet directly as a member of SharedSort.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

01 November 2017, 04:11:29

On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> So that's this bit:
>
> + pg_itoa(worker, filename);
> + lts->pfile = BufFileCreateShared(fileset, filename);
>
> ... and:
>
> + pg_itoa(i, filename);
> + file = BufFileOpenShared(fileset, filename);

Right.

> What's wrong with using a worker number like this?

I guess nothing, though there is the question of discoverability for
DBAs, etc. You do address this separately, by having (potentially)
descriptive filenames, as you go into.

> It's not random choice: buffile.c creates a uniquely named directory
> (or directories, if you have more than one location configured in the
> temp_tablespaces GUC) to hold all the backing files involved in each
> BufFileSet.  Naming of BufFiles within the BufFileSet is the caller's
> problem, and a worker number seems like a reasonable choice to me.  It
> won't collide with a concurrent parallel CREATE INDEX because that'll
> be using its own BufFileSet.

Oh, I see. I may have jumped the gun on that one.

> One complaint about the current coding that someone might object to:
> MakeSharedSegmentPath() just dumps the caller's BufFile name into a
> path without sanitisation: I should fix that so that we only accept
> fairly limited strings here.  Another complaint is that perhaps fd.c
> knows too much about buffile.c's business.  For example,
> RemovePgTempFilesInDir() knows about the ".set" directories created by
> buffile.c, which might be called a layering violation.  Perhaps the
> set/directory logic should move entirely into fd.c, so you'd call
> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
> BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
> Thoughts?

I'm going to make an item on my personal TODO list for that. No useful
insights on that right now, though.

> 3.  sharedtuplestore.c takes a caller-supplied BufFileSet and creates
> its shared BufFiles in there.  Earlier versions created and owned a
> BufFileSet, but in the current Parallel Hash patch I create loads of
> separate SharedTuplestore objects but I didn't want to create load of
> directories to back them, so you can give them all the same
> BufFileSet.  That works because SharedTuplestores are also given a
> name, and it's the caller's job (in my case nodeHash.c) to make sure
> the SharedTuplestores are given unique names within the same
> BufFileSet.  For Parallel Hash you'll see names like 'i3of8' (inner
> batch 3 of 8).  There is no need for there to be in any sort of
> central registry for that though, because it rides on top of the
> guarantees from 2 above: buffile.c will put those files into a
> uniquely named directory, and that works as long as no one else is
> allowed to create files or directories in the temp directory that
> collide with its reserved pattern /^pgsql_tmp.+\.set$/.  For the same
> reason, parallel CREATE INDEX is free to use worker numbers as BufFile
> names, since it has its own BufFileSet to work within.

If the new standard is that you have temp file names that suggest the
purpose of each temp file, then that may be something that parallel
CREATE INDEX should buy into.

> In an earlier version, BufFileSet was one of those annoying data
> structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an
> incomplete type (declared but not defined in the includable header),
> and here it was being used "inside" (or rather after) SharedSort,
> which *itself* had a FLEXIBLE_ARRAY_MEMBER.  The reason for the
> variable sized object was that I needed all backends to agree on the
> set of temporary tablespace OIDs, of which there could be any number,
> but I also needed a 'flat' (pointer-free) object I could stick in
> relocatable shared memory.  In the newest version I changed that
> flexible array to tablespaces[8], because 8 should be enough
> tablespaces for anyone (TM).

I guess that that's something that you'll need to take up with Andres,
if you haven't already. I have a hard time imagining a single query
needed to use more than that many tablespaces at once, so maybe this
is fine.

> I don't really believe anyone uses
> temp_tablespaces for IO load balancing anymore and I hate code like
> the above.  So I think Rushabh should now remove the above-quoted code
> and just use a BufFileSet directly as a member of SharedSort.

FWIW, I agree with you that nobody uses temp_tablespaces this way
these days. This seems like a discussion for your hash join patch,
though. I'm happy to buy into that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

03 November 2017, 04:20:20

On Wed, Nov 1, 2017 at 2:11 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Another complaint is that perhaps fd.c
>> knows too much about buffile.c's business.  For example,
>> RemovePgTempFilesInDir() knows about the ".set" directories created by
>> buffile.c, which might be called a layering violation.  Perhaps the
>> set/directory logic should move entirely into fd.c, so you'd call
>> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
>> BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
>> Thoughts?
>
> I'm going to make an item on my personal TODO list for that. No useful
> insights on that right now, though.

I decided to try that, but it didn't really work: fd.h gets included
by front-end code, so I can't very well define a struct and declare
functions that deal in dsm_segment and slock_t.  On the other hand it
does seem a bit better to for these shared file sets to work in terms
of File, not BufFile.  That way you don't have to opt in to BufFile's
double buffering and segmentation schemes just to get shared file
clean-up, if for some reason you want direct file handles.  So I in
the v24 parallel hash patch set I just posted over in the other thread
I have moved it into its own translation unit sharedfileset.c and made
it work with File objects.  buffile.c knows how to use it as a source
of segment files.  I think that's better.

> If the new standard is that you have temp file names that suggest the
> purpose of each temp file, then that may be something that parallel
> CREATE INDEX should buy into.

Yeah, I guess that could be useful.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

03 November 2017, 04:24:43

Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> I'm going to make an item on my personal TODO list for that. No useful
>> insights on that right now, though.
>
>I decided to try that, but it didn't really work: fd.h gets included
>by front-end code, so I can't very well define a struct and declare
>functions that deal in dsm_segment and slock_t.  On the other hand it
>does seem a bit better to for these shared file sets to work in terms
>of File, not BufFile.

Realistically, fd.h has a number of functions that are really owned by
buffile.c already. This sounds fine.

> That way you don't have to opt in to BufFile's
>double buffering and segmentation schemes just to get shared file
>clean-up, if for some reason you want direct file handles.

Is that something that you really think is possible?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

03 November 2017, 04:39:49

On Fri, Nov 3, 2017 at 2:24 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> That way you don't have to opt in to BufFile's
>> double buffering and segmentation schemes just to get shared file
>> clean-up, if for some reason you want direct file handles.
>
> Is that something that you really think is possible?

It's pretty far fetched, but maybe shared temporary relation files
accessed via smgr.c/md.c?  Or maybe future things that don't want to
read/write through a buffer but instead want to mmap it.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

14 November 2017, 12:41:39

Thanks Peter and Thomas for the review comments.

On Wed, Nov 1, 2017 at 3:59 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attaching the re based patch according to the v22 parallel-hash patch sets

I took a quick look at this today, and noticed a few issues:

* make_name() is used to name files in sharedtuplestore.c, which is
what is passed to BufFileOpenShared() for parallel hash join. Your
using your own logic for that within the equivalent logtape.c call to
BufFileOpenShared(), presumably because make_name() wants to identify
participants by PID rather than by an ordinal identifier number.

I think that we need some kind of central registry for things that use
shared buffiles. It could be that sharedtuplestore.c is further
generalized to support this, or it could be that they both call
something else that takes care of naming. It's not okay to have this
left to random chance.

You're going to have to ask Thomas about this. You should also use
MAXPGPATH for the char buffer on the stack.

Used MAXPGPATH for the char buffer.

* This logtape.c comment needs to be updated, as it's no longer true:

* successfully. In general, workers can take it that the leader will
* reclaim space in files under their ownership, and so should not
* reread from tape.

Done.

* Robert hated the comment changes in the header of nbtsort.c. You
might want to change it back, because he is likely to be the one that
commits this.

* You should look for similar comments in tuplesort.c (IIRC a couple
of places will need to be revised).

Pending.

* tuplesort_begin_common() should actively reject a randomAccess
parallel case using elog(ERROR).

Done.

* tuplesort.h should note that randomAccess isn't supported, too.

Done.

* What's this all about?:

+ /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
+ #define GetSharedBufFileSet(shared) \
+ ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

You can't just cast from one type to the other without regard for the
underling size of the shared memory buffer, which is what this looks
like to me. This only fails to crash because you're only abusing the
last member in the tapes array for this purpose, and there happens to
be enough shared memory slop that you get away with it. I'm pretty
sure that ltsUnify() ends up clobbering the last/leader tape, which is
a place where BufFileSet is also used, so this is just wrong. You
should rethink the shmem structure a little bit.

Fixed this by adding a SharedFileSet directly into the Sharedsort struct.

Thanks Thomas Munro for the offline help here.

* There is still that FIXME comment within leader_takeover_tapes(). I
believe that you should still have a leader tape (at least in local
memory in the leader), even though you'll never be able to do anything
with it, since randomAccess is no longer supported. You can remove the
FIXME, and just note that you have a leader tape to be consistent with
the serial case, though recognize that it's not useful. Note that even
with randomAccess, we always had the leader tape, so it's not that
different, really.

Done.

I suppose it might make sense to make shared->tapes not have a leader
tape. It hardly matters -- perhaps you should leave it there in order
to keep the code simple, as you'll be keeping the leader tape in local
memory, too. (But it still won't fly to continue to clobber it, of
course -- you still need to find a dedicated place for BufFileSet in
shared memory.)

Attaching the latest patch (v13) here and I will continue working on the comment

improvement part for nbtsort.c and tuplesort.c. Also will perform more testing

with the attached patch.

Patch is re-base of v25 patch set of Parallel hash.

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

0001-Add-parallel-B-tree-index-build-sorting_v13.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

14 November 2017, 21:01:05

On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Thanks Peter and Thomas for the review comments.

No problem. More feedback:

* I don't really see much need for this:

+ elog(LOG, "Worker for create index %d", parallel_workers);

You can just use trace_sort, and observe the actual behavior of the
sort that way.

* As I said before, you should remove the header comments within nbtsort.c.

* This should just say "write routines":

+ * This is why write/recycle routines don't need to know about offsets at
+ * all.

* You didn't point out the randomAccess restriction in tuplesort.h.

* I can't remember why I added the Valgrind suppression at this point.
I'd remove it until the reason becomes clear, which may never happen.
The regression tests should still pass without Valgrind warnings.

* You can add back comments removed from above LogicalTapeTell(). I
made these changes because it looked like we should close out the
possibility of doing a tell during the write phase, as unified tapes
actually would make that hard (no one does what it describes anyway).
But now, unified tapes are a distinct case to frozen tapes in a way
that they weren't before, so there is no need to make it impossible.

I also think you should replace "Assert(lt->frozen)" with
"Assert(lt->offsetBlockNumber == 0L)", for the same reason.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

15 November 2017, 21:39:58

On Tue, Nov 14, 2017 at 10:01 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia
> <rushabh.lathia@gmail.com> wrote:
>> Thanks Peter and Thomas for the review comments.
>
> No problem. More feedback:

I see that Robert just committed support for a
parallel_leader_participation GUC. Parallel tuplesort should use this,
too.

It will be easy to adopt the patch to make this work. Just change the
code within nbtsort.c to respect parallel_leader_participation, rather
than leaving that as a compile-time switch. Remove the
force_single_worker variable, and use !parallel_leader_participation
in its place.

The parallel_leader_participation docs will also need to be updated.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

07 December 2017, 11:25:06

Sorry for the delay in the another version of patch.

On Tue, Nov 14, 2017 at 11:31 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Thanks Peter and Thomas for the review comments.

No problem. More feedback:

* I don't really see much need for this:

+ elog(LOG, "Worker for create index %d", parallel_workers);

You can just use trace_sort, and observe the actual behavior of the
sort that way.

Right, that was just added for the testing purposed. Removed in the

latest version of the patch.

* As I said before, you should remove the header comments within nbtsort.c.

Done.

* This should just say "write routines":

+ * This is why write/recycle routines don't need to know about offsets at
+ * all.

Okay, done.

* You didn't point out the randomAccess restriction in tuplesort.h.

I did, it's there in the file header comments.

* I can't remember why I added the Valgrind suppression at this point.
I'd remove it until the reason becomes clear, which may never happen.
The regression tests should still pass without Valgrind warnings.

Make sense.

* You can add back comments removed from above LogicalTapeTell(). I
made these changes because it looked like we should close out the
possibility of doing a tell during the write phase, as unified tapes
actually would make that hard (no one does what it describes anyway).
But now, unified tapes are a distinct case to frozen tapes in a way
that they weren't before, so there is no need to make it impossible.

I also think you should replace "Assert(lt->frozen)" with
"Assert(lt->offsetBlockNumber == 0L)", for the same reason.

Yep, done.

I see that Robert just committed support for a
parallel_leader_participation GUC. Parallel tuplesort should use this,
too.

It will be easy to adopt the patch to make this work. Just change the
code within nbtsort.c to respect parallel_leader_participation, rather
than leaving that as a compile-time switch. Remove the
force_single_worker variable, and use !parallel_leader_participation
in its place.

Added handling for parallel_leader_participation as well as deleted

compile time option force_single_worker.

The parallel_leader_participation docs will also need to be updated.

Done.

Also performed more testing with the patch, with parallel_leader_participation

ON and OFF. Found one issue, where earlier we always used to call
_bt_leader_sort_as_worker() but now need to skip the call if parallel_leader_participation

is OFF.

Also fixed the documentation and the compilation error for the documentation.

PFA v14 patch.

...

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

0001-Add-parallel-B-tree-index-build-sorting_v14.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

08 December 2017, 03:57:26

On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

Cool. I'm glad that we now have a patch that applies cleanly against
master, while adding very little to buffile.c. It feels like we're
getting very close here.

>> * You didn't point out the randomAccess restriction in tuplesort.h.
>>
>
> I did, it's there in the file header comments.

I see what you wrote in tuplesort.h here:

> + * algorithm, and are typically only used for large amounts of data. Note
> + * that parallel sorts is not support for random access to the sort result.

This should say "...are not support when random access is requested".

> Added handling for parallel_leader_participation as well as deleted
> compile time option force_single_worker.

I still see this:

> +
> +/*
> + * A parallel sort with one worker process, and without any leader-as-worker
> + * state may be used for testing the parallel tuplesort infrastructure.
> + */
> +#ifdef NOT_USED
> +#define FORCE_SINGLE_WORKER
> +#endif

Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it, too.

>> The parallel_leader_participation docs will also need to be updated.
>>
>
> Done.

I don't see this. There is no reference to
parallel_leader_participation in the CREATE INDEX docs, nor is there a
reference to CREATE INDEX in the parallel_leader_participation docs.

> Also performed more testing with the patch, with
> parallel_leader_participation
> ON and OFF.  Found one issue, where earlier we always used to call
> _bt_leader_sort_as_worker() but now need to skip the call if
> parallel_leader_participation
> is OFF.

Hmm. I think the local variable within _bt_heapscan() should go back.
Its value should be directly taken from parallel_leader_participation
assignment, once. There might be some bizarre circumstances where it
is possible for the value of parallel_leader_participation to change
in flight, causing a race condition: we start with the leader as a
participant, and change our mind later within
_bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang
forever.

Even if that's impossible, it seems like an improvement in style to go
back to one local variable controlling everything.

Style issue here:

> + long start_block = file->numFiles * BUFFILE_SEG_SIZE;
> + int newNumFiles = file->numFiles + source->numFiles;

Shouldn't start_block conform to the surrounding camelCase style?

Finally, two new thoughts on the patch, that are not responses to
anything you did in v14:

1. Thomas' barrier abstraction was added by commit 1145acc7. I think
that you should use a static barrier in tuplesort.c now, and rip out
the ConditionVariable fields in the Sharedsort struct. It's only a
slightly higher level of abstraction for tuplesort.c, which makes only
a small difference given the simple requirements of tuplesort.c.
However, I see no reason to not go that way if that's the new
standard, which it is. This looks like it will be fairly easy.

2. Does the plan_create_index_workers() cost model need to account for
parallel_leader_participation, too, when capping workers? I think that
it does.

The relevant planner code is:

> +   /*
> +    * Cap workers based on available maintenance_work_mem as needed.
> +    *
> +    * Note that each tuplesort participant receives an even share of the
> +    * total maintenance_work_mem budget.  Aim to leave workers (where
> +    * leader-as-worker Tuplesortstate counts as a worker) with no less than
> +    * 32MB of memory.  This leaves cases where maintenance_work_mem is set to
> +    * 64MB immediately past the threshold of being capable of launching a
> +    * single parallel worker to sort.
> +    */
> +   sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ;
> +   min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ;
> +   while (parallel_workers > min_parallel_workers &&
> +          sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks)
> +       parallel_workers--;

This parallel CREATE INDEX planner code snippet is about the need to
have low per-worker maintenance_work_mem availability prevent more
parallel workers from being added to the number that we plan to
launch. Each worker tuplesort state needs at least 32MB. We clearly
need to do something here.

While it's always true that "leader-as-worker Tuplesortstate counts as
a worker" in v14, I think that it should only be true in the next
revision of the patch when parallel_leader_participation is actually
true (IOW, we should only add 1 to parallel_workers within the loop
invariant in that case). The reason why we need to consider
parallel_leader_participation within this plan_create_index_workers()
code is simple: During execution, _bt_leader_sort_as_worker() uses
"worker tuplesort states"/btshared->scantuplesortstates to determine
how much of a share of maintenance_work_mem each worker tuplesort
gets. Our planner code needs to take that into account, now that the
nbtsort.c parallel_leader_participation behavior isn't just some
obscure debug option. IOW, the planner code needs to be consistent
with the nbtsort.c execution code.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

08 December 2017, 04:23:19

On Fri, Dec 8, 2017 at 1:57 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
> that you should use a static barrier in tuplesort.c now, and rip out
> the ConditionVariable fields in the Sharedsort struct. It's only a
> slightly higher level of abstraction for tuplesort.c, which makes only
> a small difference given the simple requirements of tuplesort.c.
> However, I see no reason to not go that way if that's the new
> standard, which it is. This looks like it will be fairly easy.

I thought about this too.  A static barrier seems ideal for it, except
for one tiny detail.  We'd initialise the barrier with the number of
participants, and then after launching we get to find out how many
workers were really launched using pcxt->nworkers_launched, which may
be a smaller number.  If it's a smaller number, we need to adjust the
barrier to the smaller party size.  We can't do that by calling
BarrierDetach() n times, because Andres convinced me to assert that
you didn't try to detach from a static barrier (entirely reasonably)
and I don't really want a process to be 'detaching' on behalf of
someone else anyway.  So I think we'd need to add an extra barrier
function that lets you change the party size of a static barrier.
Yeah, that sounds like a contradiction...  but it's not the same as
the attach/detach workflow because static parties *start out
attached*, which is a very important distinction (it means that client
code doesn't have to futz about with phases, or in other words the
worker doesn't have to consider the possibility that it started up
late and missed all the action and the sort is finished).  The tidiest
way to provide this new API would, I think, be to change the internal
function BarrierDetachImpl() to take a parameter n and reduce
barrier->participants by that number, and then add a function
BarrierForgetParticipants(barrier, n) [insert better name] and have it
call BarrierDetachImpl().  Then the latter's assertion that
!static_party could move out to BarrierDetach() and
BarrierArriveAndDetach().  Alternatively, we could use the dynamic API
(see earlier parentheses about phases).

The end goal would be that code like this can use
BarrierInit(&barrier, participants), then (if necessary)
BarrierForgetParticipants(&barrier, nonstarters), and then they all
just have to call BarrierArriveAndWait() at the right time and that's
all.  Nice and tidy.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

08 December 2017, 05:03:50

On Fri, Dec 8, 2017 at 2:23 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Dec 8, 2017 at 1:57 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
>> that you should use a static barrier in tuplesort.c now, and rip out
>> the ConditionVariable fields in the Sharedsort struct.
>
> ... So I think we'd need to add an extra barrier
> function that lets you change the party size of a static barrier.

Something like the attached (untested), which would allow
_bt_begin_parallel() to call BarrierInit(&barrier, request + 1), then
BarrierForgetParticipants(&barrier, request -
pcxt->nworkers_launched), and then all the condition variable loop
stuff can be replaced with a well placed call to
BarrierArriveAndWait(&barrier, WAIT_EVENT_SOMETHING_SOMETHING).

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

barrier-forget-participants-v1.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

08 December 2017, 10:28:59

Thanks for review.

On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

Cool. I'm glad that we now have a patch that applies cleanly against
master, while adding very little to buffile.c. It feels like we're
getting very close here.

>> * You didn't point out the randomAccess restriction in tuplesort.h.
>>
>
> I did, it's there in the file header comments.

I see what you wrote in tuplesort.h here:

> + * algorithm, and are typically only used for large amounts of data. Note
> + * that parallel sorts is not support for random access to the sort result.

This should say "...are not support when random access is requested".

Done.

> Added handling for parallel_leader_participation as well as deleted
> compile time option force_single_worker.

I still see this:

> +
> +/*
> + * A parallel sort with one worker process, and without any leader-as-worker
> + * state may be used for testing the parallel tuplesort infrastructure.
> + */
> +#ifdef NOT_USED
> +#define FORCE_SINGLE_WORKER
> +#endif

Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it, too.

Done.

>> The parallel_leader_participation docs will also need to be updated.
>>
>
> Done.

I don't see this. There is no reference to
parallel_leader_participation in the CREATE INDEX docs, nor is there a
reference to CREATE INDEX in the parallel_leader_participation docs.

I thought parallel_leader_participation is generic GUC which get effect

for all parallel operation. isn't it? On that understanding I just update the

documentation of parallel_leader_participation into config.sgml to

make it more generalize.

> Also performed more testing with the patch, with
> parallel_leader_participation
> ON and OFF. Found one issue, where earlier we always used to call
> _bt_leader_sort_as_worker() but now need to skip the call if
> parallel_leader_participation
> is OFF.

Hmm. I think the local variable within _bt_heapscan() should go back.
Its value should be directly taken from parallel_leader_participation
assignment, once. There might be some bizarre circumstances where it
is possible for the value of parallel_leader_participation to change
in flight, causing a race condition: we start with the leader as a
participant, and change our mind later within
_bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang
forever.

Even if that's impossible, it seems like an improvement in style to go
back to one local variable controlling everything.

Yes, to me also it's looks kind of impossible situation but then too

it make sense to make one local variable and then always read the

value from that.

Style issue here:

> + long start_block = file->numFiles * BUFFILE_SEG_SIZE;
> + int newNumFiles = file->numFiles + source->numFiles;

Shouldn't start_block conform to the surrounding camelCase style?

Done.

Finally, two new thoughts on the patch, that are not responses to
anything you did in v14:

1. Thomas' barrier abstraction was added by commit 1145acc7. I think
that you should use a static barrier in tuplesort.c now, and rip out
the ConditionVariable fields in the Sharedsort struct. It's only a
slightly higher level of abstraction for tuplesort.c, which makes only
a small difference given the simple requirements of tuplesort.c.
However, I see no reason to not go that way if that's the new
standard, which it is. This looks like it will be fairly easy.

Pending, as per Thomas' explanation, it seems like need some more
work in the barrier APIs.

2. Does the plan_create_index_workers() cost model need to account for
parallel_leader_participation, too, when capping workers? I think that
it does.

The relevant planner code is:

> + /*
> + * Cap workers based on available maintenance_work_mem as needed.
> + *
> + * Note that each tuplesort participant receives an even share of the
> + * total maintenance_work_mem budget. Aim to leave workers (where
> + * leader-as-worker Tuplesortstate counts as a worker) with no less than
> + * 32MB of memory. This leaves cases where maintenance_work_mem is set to
> + * 64MB immediately past the threshold of being capable of launching a
> + * single parallel worker to sort.
> + */
> + sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ;
> + min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ;
> + while (parallel_workers > min_parallel_workers &&
> + sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks)
> + parallel_workers--;

This parallel CREATE INDEX planner code snippet is about the need to
have low per-worker maintenance_work_mem availability prevent more
parallel workers from being added to the number that we plan to
launch. Each worker tuplesort state needs at least 32MB. We clearly
need to do something here.

While it's always true that "leader-as-worker Tuplesortstate counts as
a worker" in v14, I think that it should only be true in the next
revision of the patch when parallel_leader_participation is actually
true (IOW, we should only add 1 to parallel_workers within the loop
invariant in that case). The reason why we need to consider
parallel_leader_participation within this plan_create_index_workers()
code is simple: During execution, _bt_leader_sort_as_worker() uses
"worker tuplesort states"/btshared->scantuplesortstates to determine
how much of a share of maintenance_work_mem each worker tuplesort
gets. Our planner code needs to take that into account, now that the
nbtsort.c parallel_leader_participation behavior isn't just some
obscure debug option. IOW, the planner code needs to be consistent
with the nbtsort.c execution code.

Ah nice catch. I passed the local variable (leaderasworker) of _bt_heapscan()

to plan_create_index_workers() rather than direct reading value from the
parallel_leader_participation (reasons are same as you explained earlier).

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree indexcreation)

From

"Tels"

Date:

08 December 2017, 12:24:45

Hello Rushabh,

On Fri, December 8, 2017 2:28 am, Rushabh Lathia wrote:
> Thanks for review.
>
> On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>
>> On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>> > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

I've looked only at patch 0002, here are some comments.

> + * leaderasworker indicates whether leader going to participate as
worker  or
> + * not.

The grammar is a bit off, and the "or not" seems obvious. IMHO this could be:

+ * leaderasworker indicates whether the leader is going to participate as
worker

The argument leaderasworker is only used once and for one temp. variable
that is only used once, too. So the temp. variable could maybe go.

And not sure what the verdict was from the const-discussion threads, I did
not follow it through. If "const" is what should be done generally, then
the argument could be consted, as to not create more "unconsted" code.

E.g. so:

+plan_create_index_workers(Oid tableOid, Oid indexOid, const bool
leaderasworker)

and later:

-                   sort_mem_blocks / (parallel_workers + 1) <
min_sort_mem_blocks)
+                   sort_mem_blocks / (parallel_workers + (leaderasworker
? 1 : 0)) < min_sort_mem_blocks)

Thank you for working on this patch!

All the best,

Tels

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

10 December 2017, 00:36:08

On Thu, Dec 7, 2017 at 11:28 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> I thought parallel_leader_participation is generic GUC which get effect
> for all parallel operation. isn't it?  On that understanding I just update
> the
> documentation of parallel_leader_participation into config.sgml to
> make it more generalize.

Okay. I'm not quite sure how to fit parallel_leader_participation into
parallel CREATE INDEX (see my remarks on that below).

I see a new bug in the patch (my own bug). Which is: the CONCURRENTLY
case should obtain a RowExclusiveLock on the index relation within
_bt_worker_main(), not an AccessExclusiveLock. That's all the leader
has at that point within CREATE INDEX CONCURRENTLY.

I now believe that index_create() should reject catalog parallel
CREATE INDEX directly, just as it does for catalog CREATE INDEX
CONCURRENTLY. That logic should be generic to all AMs, since the
reasons for disallowing catalog parallel index builds are generic.

On a similar note, *maybe* we should even call
plan_create_index_workers() from index_create() (or at least some
point within index.c). You're going to need a new field or two within
IndexInfo for this, beside ii_Concurrent/ii_BrokenHotChain (next to
the other stuff that is only used during index builds). Maybe
ii_ParallelWorkers, and ii_LeaderAsWorker. What do you think of this
suggestion? It's probably neater overall...though I'm less confident
that this one is an improvement.

Note that cluster.c calls plan_cluster_use_sort() directly, while
checking "OldIndex->rd_rel->relam == BTREE_AM_OID" as a prerequisite
to calling it. This seems like it might be considered an example that
we should follow within index.c -- plan_create_index_workers() is
based on plan_cluster_use_sort().

> Yes, to me also it's looks kind of impossible situation but then too
> it make sense to make one local variable and then always read the
> value from that.

I think that it probably is technically possible, though the user
would have to be doing something insane for it to be a problem. As I'm
sure you understand, it's simpler to eliminate the possibility than it
is to reason about it never happening.

>> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
>> that you should use a static barrier in tuplesort.c now, and rip out
>> the ConditionVariable fields in the Sharedsort struct.

> Pending, as per Thomas' explanation,  it seems like need some more
> work in the barrier APIs.

Okay. It's not the case that parallel tuplesort would significantly
benefit from using the barrier abstraction, so I don't think we need
to consider this a blocker to commit. My concern is mostly just that
everyone is on the same page with barriers.

> Ah nice catch.  I passed the local variable (leaderasworker) of
> _bt_heapscan()
> to plan_create_index_workers() rather than direct reading value from the
> parallel_leader_participation (reasons are same as you explained earlier).

Cool. I don't think that this should be a separate patch -- please
rebase + squash.

Do you think that the main part of the cost model needs to care about
parallel_leader_participation, too?

compute_parallel_worker() assumes that the caller is planning a
parallel-sequential-scan-alike thing, in the sense that the leader
only acts like a worker in cases that probably don't have many
workers, where the leader cannot keep itself busy as a leader. That's
actually quite different to parallel CREATE INDEX, because the
leader-as-worker state will behave in exactly the same way as a worker
would, no matter how many workers there are. The leader process is
guaranteed to give its full attention to being a worker, because it
has precisely nothing else to do until workers finish. This makes me
think that we may need to immediately do something with the result of
compute_parallel_worker(), to consider whether or not a
leader-as-worker state should be used, despite the fact that no
existing compute_parallel_worker() caller does anything like this.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

11 December 2017, 07:49:06

Thanks Tels for reviewing the patch.

On Fri, Dec 8, 2017 at 2:54 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:

Hello Rushabh,

On Fri, December 8, 2017 2:28 am, Rushabh Lathia wrote:
> Thanks for review.
>
> On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>
>> On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>> > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

I've looked only at patch 0002, here are some comments.

> + * leaderasworker indicates whether leader going to participate as
worker or
> + * not.

The grammar is a bit off, and the "or not" seems obvious. IMHO this could be:

+ * leaderasworker indicates whether the leader is going to participate as
worker

Sure.

The argument leaderasworker is only used once and for one temp. variable
that is only used once, too. So the temp. variable could maybe go.

And not sure what the verdict was from the const-discussion threads, I did
not follow it through. If "const" is what should be done generally, then
the argument could be consted, as to not create more "unconsted" code.

E.g. so:

+plan_create_index_workers(Oid tableOid, Oid indexOid, const bool
leaderasworker)

Make sense.

and later:

- sort_mem_blocks / (parallel_workers + 1) <
min_sort_mem_blocks)
+ sort_mem_blocks / (parallel_workers + (leaderasworker
? 1 : 0)) < min_sort_mem_blocks)

Even I didn't liked to take a extra variable, but then code looks bit

unreadable - so rather then making difficult to read the code - I thought

of adding new variable.

Thank you for working on this patch!

I will address review comments in the next set of patches.

Regards,

Rushabh Lathia

www.EnterpriseDB.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

12 December 2017, 13:09:29

On Sun, Dec 10, 2017 at 3:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Dec 7, 2017 at 11:28 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> I thought parallel_leader_participation is generic GUC which get effect
> for all parallel operation. isn't it? On that understanding I just update
> the
> documentation of parallel_leader_participation into config.sgml to
> make it more generalize.

Okay. I'm not quite sure how to fit parallel_leader_participation into
parallel CREATE INDEX (see my remarks on that below).

I see a new bug in the patch (my own bug). Which is: the CONCURRENTLY
case should obtain a RowExclusiveLock on the index relation within
_bt_worker_main(), not an AccessExclusiveLock. That's all the leader
has at that point within CREATE INDEX CONCURRENTLY.

Oh right. I also missed to test that earlier. Fixed now.

I now believe that index_create() should reject catalog parallel
CREATE INDEX directly, just as it does for catalog CREATE INDEX
CONCURRENTLY. That logic should be generic to all AMs, since the
reasons for disallowing catalog parallel index builds are generic.

Sorry I didn't get this, reject means? you mean it should throw an error
catalog parallel CREATE INDEX? or just suggesting to set the

ParallelWorkers and may be LeaderAsWorker from index_create()

or may be index_build()?

On a similar note, *maybe* we should even call
plan_create_index_workers() from index_create() (or at least some
point within index.c). You're going to need a new field or two within
IndexInfo for this, beside ii_Concurrent/ii_BrokenHotChain (next to
the other stuff that is only used during index builds). Maybe
ii_ParallelWorkers, and ii_LeaderAsWorker. What do you think of this
suggestion? It's probably neater overall...though I'm less confident
that this one is an improvement.

Note that cluster.c calls plan_cluster_use_sort() directly, while
checking "OldIndex->rd_rel->relam == BTREE_AM_OID" as a prerequisite
to calling it. This seems like it might be considered an example that
we should follow within index.c -- plan_create_index_workers() is
based on plan_cluster_use_sort().

> Yes, to me also it's looks kind of impossible situation but then too
> it make sense to make one local variable and then always read the
> value from that.

I think that it probably is technically possible, though the user
would have to be doing something insane for it to be a problem. As I'm
sure you understand, it's simpler to eliminate the possibility than it
is to reason about it never happening.

yes.

>> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
>> that you should use a static barrier in tuplesort.c now, and rip out
>> the ConditionVariable fields in the Sharedsort struct.

> Pending, as per Thomas' explanation, it seems like need some more
> work in the barrier APIs.

Okay. It's not the case that parallel tuplesort would significantly
benefit from using the barrier abstraction, so I don't think we need
to consider this a blocker to commit. My concern is mostly just that
everyone is on the same page with barriers.

True, if needed, this can be also done later on.

> Ah nice catch. I passed the local variable (leaderasworker) of
> _bt_heapscan()
> to plan_create_index_workers() rather than direct reading value from the
> parallel_leader_participation (reasons are same as you explained earlier).

Cool. I don't think that this should be a separate patch -- please
rebase + squash.

Sure, done.

Do you think that the main part of the cost model needs to care about
parallel_leader_participation, too?

compute_parallel_worker() assumes that the caller is planning a
parallel-sequential-scan-alike thing, in the sense that the leader
only acts like a worker in cases that probably don't have many
workers, where the leader cannot keep itself busy as a leader. That's
actually quite different to parallel CREATE INDEX, because the
leader-as-worker state will behave in exactly the same way as a worker
would, no matter how many workers there are. The leader process is
guaranteed to give its full attention to being a worker, because it
has precisely nothing else to do until workers finish. This makes me
think that we may need to immediately do something with the result of
compute_parallel_worker(), to consider whether or not a
leader-as-worker state should be used, despite the fact that no
existing compute_parallel_worker() caller does anything like this.

I agree with you. compute_parallel_worker() mainly design for the

scan-alike things. Where as parallel create index is different in a

sense where leader has as much power as worker. But at the same

time I don't see any side effect or negative of that with PARALLEL

CREATE INDEX. So I am more towards not changing that aleast

for now - as part of this patch.

Thanks for review.

Regards,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

31 December 2017, 19:29:02

On Tue, Dec 12, 2017 at 2:09 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>> I now believe that index_create() should reject catalog parallel
>> CREATE INDEX directly, just as it does for catalog CREATE INDEX
>> CONCURRENTLY. That logic should be generic to all AMs, since the
>> reasons for disallowing catalog parallel index builds are generic.
>>
>
> Sorry I didn't get this, reject means? you mean it should throw an error
> catalog parallel CREATE INDEX? or just suggesting to set the
> ParallelWorkers and may be LeaderAsWorker from index_create()
> or may be index_build()?

I mean that we should be careful to make sure that AM-generic parallel
CREATE INDEX logic does not end up in a specific AM (nbtree).

The patch *already* refuses to perform a parallel CREATE INDEX on a
system catalog, which is what I meant by reject (sorry for being
unclear). The point is that that's due to a restriction that has
nothing to do with nbtree in particular (just like the CIC restriction
on catalogs), so it should be performed within index_build(). Just
like the similar CONCURRENTLY-on-a-catalog restriction, though without
throwing an error, since of course the user doesn't explicitly ask for
a parallel CREATE INDEX at any point (unlike CONCURRENTLY).

Once we go this way, the cost model has to be called at that point,
too. We already have the AM-specific "OldIndex->rd_rel->relam ==
BTREE_AM_OID" tests within cluster.c, even though theoretically
another AM might be involved with CLUSTER in the future, which this
seems similar to.

So, I propose the following (this is a rough outline):

* Add new IndexInfo files after ii_Concurrent/ii_BrokenHotChain  --
ii_ParallelWorkers and ii_LeaderAsWorker.

* Call plan_create_index_workers() within index_create(), assigning to
ii_ParallelWorkers, and fill in ii_LeaderAsWorker from the
parallel_leader_participation GUC. Add comments along the lines of
"only nbtree supports parallel builds". Test the index with a
"heapRelation->rd_rel->relam == BTREE_AM_OID" to make this work.
Otherwise, assign zero to ii_ParallelWorkers (and leave
ii_LeaderAsWorker as false).

* For builds on catalogs, or builds using other AMs, don't let
parallelism go ahead by immediately assigning zero to
ii_ParallelWorkers within index_create(), near where the similar CIC
test occurs already.

What do you think of that?

>> Do you think that the main part of the cost model needs to care about
>> parallel_leader_participation, too?
>>
>> compute_parallel_worker() assumes that the caller is planning a
>> parallel-sequential-scan-alike thing, in the sense that the leader
>> only acts like a worker in cases that probably don't have many
>> workers, where the leader cannot keep itself busy as a leader. That's
>> actually quite different to parallel CREATE INDEX, because the
>> leader-as-worker state will behave in exactly the same way as a worker
>> would, no matter how many workers there are. The leader process is
>> guaranteed to give its full attention to being a worker, because it
>> has precisely nothing else to do until workers finish. This makes me
>> think that we may need to immediately do something with the result of
>> compute_parallel_worker(), to consider whether or not a
>> leader-as-worker state should be used, despite the fact that no
>> existing compute_parallel_worker() caller does anything like this.
>>
>
> I agree with you. compute_parallel_worker() mainly design for the
> scan-alike things. Where as parallel create index is different in a
> sense where leader has as much power as worker.  But at the same
> time I don't see any side effect or negative of that with PARALLEL
> CREATE INDEX.  So I am more towards not changing that aleast
> for now - as part of this patch.

I've also noticed is that there is little to no negative effect on
CREATE INDEX duration from adding new workers past the point where
adding more workers stops making the build faster. It's quite clear.
And, in general, there isn't all that much theoretical justification
for the cost model (it's essentially the same as any other parallel
scan), which doesn't seem to matter much. So, I agree that it doesn't
really matter in practice, but disagree that it should not still be
changed -- the justification may be a little thin, but I think that we
need to stick to it. There should be a theoretical justification for
the cost model that is coherent in the wider context of costs models
for parallelism in general. It should not be arbitrarily inconsistent
just because it apparently doesn't matter that much. It's easy to fix
-- let's just fix it.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

02 January 2018, 12:38:14

On Sun, Dec 31, 2017 at 9:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Dec 12, 2017 at 2:09 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>> I now believe that index_create() should reject catalog parallel
>> CREATE INDEX directly, just as it does for catalog CREATE INDEX
>> CONCURRENTLY. That logic should be generic to all AMs, since the
>> reasons for disallowing catalog parallel index builds are generic.
>>
>
> Sorry I didn't get this, reject means? you mean it should throw an error
> catalog parallel CREATE INDEX? or just suggesting to set the
> ParallelWorkers and may be LeaderAsWorker from index_create()
> or may be index_build()?

I mean that we should be careful to make sure that AM-generic parallel
CREATE INDEX logic does not end up in a specific AM (nbtree).

Ah okay, that's what I thought.

The patch *already* refuses to perform a parallel CREATE INDEX on a
system catalog, which is what I meant by reject (sorry for being
unclear). The point is that that's due to a restriction that has
nothing to do with nbtree in particular (just like the CIC restriction
on catalogs), so it should be performed within index_build(). Just
like the similar CONCURRENTLY-on-a-catalog restriction, though without
throwing an error, since of course the user doesn't explicitly ask for
a parallel CREATE INDEX at any point (unlike CONCURRENTLY).

Once we go this way, the cost model has to be called at that point,
too. We already have the AM-specific "OldIndex->rd_rel->relam ==
BTREE_AM_OID" tests within cluster.c, even though theoretically
another AM might be involved with CLUSTER in the future, which this
seems similar to.

So, I propose the following (this is a rough outline):

* Add new IndexInfo files after ii_Concurrent/ii_BrokenHotChain --
ii_ParallelWorkers and ii_LeaderAsWorker.

* Call plan_create_index_workers() within index_create(), assigning to
ii_ParallelWorkers, and fill in ii_LeaderAsWorker from the
parallel_leader_participation GUC. Add comments along the lines of
"only nbtree supports parallel builds". Test the index with a
"heapRelation->rd_rel->relam == BTREE_AM_OID" to make this work.
Otherwise, assign zero to ii_ParallelWorkers (and leave
ii_LeaderAsWorker as false).

* For builds on catalogs, or builds using other AMs, don't let
parallelism go ahead by immediately assigning zero to
ii_ParallelWorkers within index_create(), near where the similar CIC
test occurs already.

What do you think of that?

Need to do after the indexRelation build. So I added after update of pg_index,

as indexRelation needed for plan_create_index_worders().

Attaching the separate patch the same.

>> Do you think that the main part of the cost model needs to care about
>> parallel_leader_participation, too?
>>
>> compute_parallel_worker() assumes that the caller is planning a
>> parallel-sequential-scan-alike thing, in the sense that the leader
>> only acts like a worker in cases that probably don't have many
>> workers, where the leader cannot keep itself busy as a leader. That's
>> actually quite different to parallel CREATE INDEX, because the
>> leader-as-worker state will behave in exactly the same way as a worker
>> would, no matter how many workers there are. The leader process is
>> guaranteed to give its full attention to being a worker, because it
>> has precisely nothing else to do until workers finish. This makes me
>> think that we may need to immediately do something with the result of
>> compute_parallel_worker(), to consider whether or not a
>> leader-as-worker state should be used, despite the fact that no
>> existing compute_parallel_worker() caller does anything like this.
>>
>
> I agree with you. compute_parallel_worker() mainly design for the
> scan-alike things. Where as parallel create index is different in a
> sense where leader has as much power as worker. But at the same
> time I don't see any side effect or negative of that with PARALLEL
> CREATE INDEX. So I am more towards not changing that aleast
> for now - as part of this patch.

I've also noticed is that there is little to no negative effect on
CREATE INDEX duration from adding new workers past the point where
adding more workers stops making the build faster. It's quite clear.
And, in general, there isn't all that much theoretical justification
for the cost model (it's essentially the same as any other parallel
scan), which doesn't seem to matter much. So, I agree that it doesn't
really matter in practice, but disagree that it should not still be
changed -- the justification may be a little thin, but I think that we
need to stick to it. There should be a theoretical justification for
the cost model that is coherent in the wider context of costs models
for parallelism in general. It should not be arbitrarily inconsistent
just because it apparently doesn't matter that much. It's easy to fix
-- let's just fix it.

So you suggesting that need to do adjustment with the output of
compute_parallel_worker() by considering parallel_leader_participation?

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

03 January 2018, 06:41:35

On Tue, Jan 2, 2018 at 1:38 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Need to do after the indexRelation build. So I added after update of
> pg_index,
> as indexRelation needed for plan_create_index_worders().
>
> Attaching the separate patch the same.

This made it so that REINDEX and CREATE INDEX CONCURRENTLY no longer
used parallelism. I think we need to do this very late, just before
nbtree's ambuild() routine is called from index.c.

> So you suggesting that need to do adjustment with the output of
> compute_parallel_worker() by considering parallel_leader_participation?

We know for sure that there is no reason to not use the leader process
as a worker process in the case of parallel CREATE INDEX. So we must
not have the number of participants (i.e. worker Tuplesortstates) vary
based on the current parallel_leader_participation setting. While
parallel_leader_participation can affect the number of worker
processes requested, that's a different thing. There is no question
about parallel_leader_participation ever being relevant to performance
-- it's strictly a testing option for us.

Even after parallel_leader_participation was added,
compute_parallel_worker() still assumes that the sequential scan
leader is always too busy to help. compute_parallel_worker() seems to
think that that's something that the leader does in "rare" cases not
worth considering -- cases where it has no worker tuples to consume
(maybe I'm reading too much into it not caring about
parallel_leader_participation, but I don't think so). If
compute_parallel_worker()'s assumption was questionable before, it's
completely wrong for parallel CREATE INDEX. I think
plan_create_index_workers() needs to count the leader-as-worker as an
ordinary worker, not special in any way by deducting one worker from
what compute_parallel_worker() returns. (This only happens when it's
necessary to compensate -- when leader-as-worker participation is
going to go ahead.)

I'm working on fixing up what you posted. I'm probably not more than a
week away from posting a patch that I'm going to mark "ready for
committer". I've already made the change above, and once I spend time
on trying to break the few small changes needed within buffile.c I'll
have taken it as far as I can, most likely.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

03 January 2018, 07:43:27

On Wed, Jan 3, 2018 at 9:11 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jan 2, 2018 at 1:38 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Need to do after the indexRelation build. So I added after update of
> pg_index,
> as indexRelation needed for plan_create_index_worders().
>
> Attaching the separate patch the same.

This made it so that REINDEX and CREATE INDEX CONCURRENTLY no longer
used parallelism. I think we need to do this very late, just before
nbtree's ambuild() routine is called from index.c.

Ahh right. We should move the plan_create_index_workers() call to
index_build() before the ambuild().

> So you suggesting that need to do adjustment with the output of
> compute_parallel_worker() by considering parallel_leader_participation?

We know for sure that there is no reason to not use the leader process
as a worker process in the case of parallel CREATE INDEX. So we must
not have the number of participants (i.e. worker Tuplesortstates) vary
based on the current parallel_leader_participation setting. While
parallel_leader_participation can affect the number of worker
processes requested, that's a different thing. There is no question
about parallel_leader_participation ever being relevant to performance
-- it's strictly a testing option for us.

Even after parallel_leader_participation was added,
compute_parallel_worker() still assumes that the sequential scan
leader is always too busy to help. compute_parallel_worker() seems to
think that that's something that the leader does in "rare" cases not
worth considering -- cases where it has no worker tuples to consume
(maybe I'm reading too much into it not caring about
parallel_leader_participation, but I don't think so). If
compute_parallel_worker()'s assumption was questionable before, it's
completely wrong for parallel CREATE INDEX. I think
plan_create_index_workers() needs to count the leader-as-worker as an
ordinary worker, not special in any way by deducting one worker from
what compute_parallel_worker() returns. (This only happens when it's
necessary to compensate -- when leader-as-worker participation is
going to go ahead.)

Yes, event with parallel_leader_participation - compute_parallel_worker()

doesn't take that into consideration. Or may be the assumption is that

launch the number of workers return by the compute_parallel_worker(),

irrespective of whether leader is going to participate in a scan or not.

I agree that plan_create_index_workers() needs to count the leader as a

normal worker for the CREATE INDEX. So what you proposing is - when
parallel_leader_participation is true launch (return value of compute_parallel_worker() - 1)

workers. true ?

I'm working on fixing up what you posted. I'm probably not more than a
week away from posting a patch that I'm going to mark "ready for
committer". I've already made the change above, and once I spend time
on trying to break the few small changes needed within buffile.c I'll
have taken it as far as I can, most likely.

Okay, once you submit the patch with changes - I will do one round of

review for the changes.

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 January 2018, 01:17:51

On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> I agree that plan_create_index_workers() needs to count the leader as a
> normal worker for the CREATE INDEX.  So what you proposing is - when
> parallel_leader_participation is true launch (return value of
> compute_parallel_worker() - 1)
> workers. true ?

Almost. We need to not subtract one when only one worker is indicated
by compute_parallel_worker(). I also added some new stuff there, to
consider edge cases with the parallel_leader_participation GUC.

>> I'm working on fixing up what you posted. I'm probably not more than a
>> week away from posting a patch that I'm going to mark "ready for
>> committer". I've already made the change above, and once I spend time
>> on trying to break the few small changes needed within buffile.c I'll
>> have taken it as far as I can, most likely.
>>
>
> Okay, once you submit the patch with changes - I will do one round of
> review for the changes.

I've attached my revision. Changes include:

* Changes to plan_create_index_workers() were made along the lines
recently discussed.

* plan_create_index_workers() is now called right before the ambuild
routine is called (nbtree index builds only, of course).

* Significant overhaul of tuplesort.h contract. This had references to
the old approach, and to tqueue.c's tuple descriptor thing that was
since superseded by the typmod registry added for parallel hash join.
These were updated/removed.

* Both tuplesort.c and logtape.c now say that they cannot write to the
writable/last tape, while still acknowledging that it is in fact the
leader tape, and that this restriction is due to a restriction with
BufFiles. They also point out that if the restriction within buffile.c
ever was removed, everything would work fine.

* Added new call to BufFileExportShared() when freezing tape in logtape.c.

* Tweaks to documentation.

* pgindent ran on modified files.

* Polished the stuff that is added to buffile.c. Mostly comments that
clarify its reason for existing. Also added Assert()s.

Note that I added Heikki as an author in the commit message.
Technically, Heikki didn't actually write code for parallel CREATE
INDEX, but he did loads of independently useful work on merging + temp
file I/O that went into Postgres 10 (though this wasn't listed in the
v10 release notes). That work was done in large part to help the
parallel CREATE INDEX patch, and it did in fact help it quite
noticeably, so I think that this is warranted. Remember that with
parallel CREATE INDEX, the leader's merge occurs serially, so anything
that we can do to speed that part up is very helpful.

This revision does seem very close, but I'll hold off on changing the
status of the patch for a few more days, to give you time to give some
feedback.

-- 
Peter Geoghegan

Attachment

0001-Add-parallel-B-tree-index-build-sorting.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

09 January 2018, 08:44:53

On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> I agree that plan_create_index_workers() needs to count the leader as a
> normal worker for the CREATE INDEX. So what you proposing is - when
> parallel_leader_participation is true launch (return value of
> compute_parallel_worker() - 1)
> workers. true ?

Almost. We need to not subtract one when only one worker is indicated
by compute_parallel_worker(). I also added some new stuff there, to
consider edge cases with the parallel_leader_participation GUC.

>> I'm working on fixing up what you posted. I'm probably not more than a
>> week away from posting a patch that I'm going to mark "ready for
>> committer". I've already made the change above, and once I spend time
>> on trying to break the few small changes needed within buffile.c I'll
>> have taken it as far as I can, most likely.
>>
>
> Okay, once you submit the patch with changes - I will do one round of
> review for the changes.

I've attached my revision. Changes include:

* Changes to plan_create_index_workers() were made along the lines
recently discussed.

* plan_create_index_workers() is now called right before the ambuild
routine is called (nbtree index builds only, of course).

* Significant overhaul of tuplesort.h contract. This had references to
the old approach, and to tqueue.c's tuple descriptor thing that was
since superseded by the typmod registry added for parallel hash join.
These were updated/removed.

* Both tuplesort.c and logtape.c now say that they cannot write to the
writable/last tape, while still acknowledging that it is in fact the
leader tape, and that this restriction is due to a restriction with
BufFiles. They also point out that if the restriction within buffile.c
ever was removed, everything would work fine.

* Added new call to BufFileExportShared() when freezing tape in logtape.c.

* Tweaks to documentation.

* pgindent ran on modified files.

* Polished the stuff that is added to buffile.c. Mostly comments that
clarify its reason for existing. Also added Assert()s.

Note that I added Heikki as an author in the commit message.
Technically, Heikki didn't actually write code for parallel CREATE
INDEX, but he did loads of independently useful work on merging + temp
file I/O that went into Postgres 10 (though this wasn't listed in the
v10 release notes). That work was done in large part to help the
parallel CREATE INDEX patch, and it did in fact help it quite
noticeably, so I think that this is warranted. Remember that with
parallel CREATE INDEX, the leader's merge occurs serially, so anything
that we can do to speed that part up is very helpful.

This revision does seem very close, but I'll hold off on changing the
status of the patch for a few more days, to give you time to give some
feedback.

Thanks Peter for the updated patch.

I gone through the changes and perform the basic testing. Changes

looks good and haven't found any unusual during testing

Thanks,

Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

10 January 2018, 06:12:38

On Mon, Jan 8, 2018 at 9:44 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> I gone through the changes and perform the basic testing. Changes
> looks good and haven't found any unusual during testing

Then I'll mark the patch "Ready for Committer" now. I think that we've
done just about all we can with it.

There is one lingering concern that I cannot shake, which stems from
the fact that the cost model (plan_create_index_workers()) follows the
same generic logic for adding workers as parallel sequential scan, per
Robert's feedback from around March of last year (that is, we more or
less just reuse compute_parallel_worker()). My specific concern is
that this approach may be too aggressive in situations where a
parallel external sort ends up being used instead of a serial internal
sort. No weight is given to any extra temp file costs; a serial
external sort is, in a sense, the baseline, including in cases where
the table is very small and an external sort can actually easily be
avoided iff we do a serial sort.

This is probably not worth doing anything about. The distinction
between internal and external sorts became rather blurred in 9.6 and
10, which, in a way, this patch builds on. If what I describe is a
problem at all, it will very probably only be a problem on small
CREATE INDEX operations, where linear sequential I/O costs are not
already dwarfed by the linearithmic CPU costs. (The dominance of
CPU/comparison costs on larger sorts is the main reason why external
sorts can be faster than internal sorts -- this happens fairly
frequently these days, especially with CREATE INDEX, where being able
to write out the index as it merges on-the-fly helps a lot.)

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

10 January 2018, 06:36:20

On Sat, Jan 6, 2018 at 11:17 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> * Significant overhaul of tuplesort.h contract. This had references to
> the old approach, and to tqueue.c's tuple descriptor thing that was
> since superseded by the typmod registry added for parallel hash join.
> These were updated/removed.

+1

> * Both tuplesort.c and logtape.c now say that they cannot write to the
> writable/last tape, while still acknowledging that it is in fact the
> leader tape, and that this restriction is due to a restriction with
> BufFiles. They also point out that if the restriction within buffile.c
> ever was removed, everything would work fine.

+1

> * Added new call to BufFileExportShared() when freezing tape in logtape.c.

+1

> * Polished the stuff that is added to buffile.c. Mostly comments that
> clarify its reason for existing. Also added Assert()s.

+1

This looks good to me.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

10 January 2018, 21:45:14

On Tue, Jan 9, 2018 at 10:36 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> This looks good to me.

The addition to README.parallel is basically wrong, because workers
have been allowed to write WAL since the ParallelContext machinery.
See the
XactLastRecEnd handling in parallel.c.  Workers can, for example, due
HOT cleanups during SELECT scans, just as the leader can.  The
language here is obsolete anyway in light of commit
e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way
to update it.  I'll propose a separate patch for that.

The change to the ParallelContext signature in parallel.h makes an
already-overlength line even longer.  A line break seems warranted
just after the first argument, plus pgindent afterward.

I am not a fan of the leader-as-worker terminology.  The leader is not
a worker, full stop.  I think we should instead talk about whether the
leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates,
for example, plus many comment updates).  Similarly, it seems
SortCoordinateData's nLaunched should be nParticipants, and BTLeader's
nworkertuplesorts should be nparticipanttuplesorts.

There is also the question of whether we want to respect
parallel_leader_participation in this context.  The issues which might
motivate the desire for such behavior in the context of a query do not
exist when creating a btree index, so maybe we're just making this
complicated. On the other hand, if some other type of parallel index
build does end up doing a Gather-like operation then we might regret
deciding that parallel_leader_participation doesn't apply to index
builds, so maybe it's OK the way we have it.  On the third hand, the
complexity of having the leader maybe-participate seems like it
extends to a fair number of places in the code, and getting rid of all
that complexity seems appealing.

One place where this actually causes a problem is the message changes
to index_build().  The revised ereport() violates translatability
guidelines, which require that messages not be assembled from pieces.
See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES

A comment added to tuplesort.h says that work_mem should be at least
64KB, but does not give any reason.  I think one should be given, at
least briefly, so that someone looking at these comments in the future
can, for example, figure out whether the comment is still correct
after future code changes.  Or else, remove the comment.

+ * Parallel sort callers are required to coordinate multiple tuplesort states
+ * in a leader process, and one or more worker processes.  The leader process

I think the comma should be removed.  As written it, it looks like we
are coordinating multiple tuplesort states in a leader process, and,
separately, we are coordinating one or more worker processes.  But in
fact we are coordinating multiple tuplesort states which are in a
group of processes that includes the leader and one or more worker
processes.

Generally, I think the comments in tuplesort.h are excellent.  I
really like the overview of how the new interfaces should be used,
although I find it slightly wonky that the leader needs two separate
Tuplesortstates if it wants to participate.

I don't understand why this patch needs to tinker with the tests in
vacuum.sql.  The comments say that "If we did not do this, errors
raised would concern running ANALYZE in parallel mode."  However, why
should parallel CREATE INDEX having any impact on ANALYZE at all?
Also, as a practical matter, if I revert those changes, 'make check'
still passes with or without force_parallel_mode=on.

I really dislike the fact that this patch invents another thing for
force_parallel_mode to do.  I invented force_parallel_mode mostly as a
way of testing that functions were correctly labeled for
parallel-safety, and I think it would be just fine if it never does
anything else.  As it is, it already does two quite separate things to
accomplish that goal: (1) forcibly run the whole plan with parallel
mode restrictions enabled, provided that the plan is not
parallel-unsafe, and (2) runs the plan in a worker, provided that the
plan is parallel-safe.  There's a subtle difference between those two
condition, which is that not parallel-unsafe does not equal
parallel-safe; there is also parallel-restricted.  The fact that
force_parallel_mode controls two different behaviors has, I think,
already caused some confusion for prominent PostgreSQL developers and,
likely, users as well.  Making it do a third thing seems to me to be
adding to the confusion, and not only because there are no
documentation changes to match.  If we go down this road, there will
probably be more additions -- what happens when parallel VACUUM
arrives, or parallel CLUSTER, or whatever?  I don't think it will be a
good thing for PostgreSQL if we end up with force_parallel_mode=on as
a general "use parallelism even though it's stupid" flag, requiring
supporting code in many different places throughout the code base and
a laundry list of not-actually-useful behavior changes in the
documentation.

What I think would be a lot more useful, and what I sort of expected
the patch to have, is a way for a user to explicitly control the
number of workers requested for a CREATE INDEX operation.  We all know
that the cost model is crude and that may be OK -- though it would be
interesting to see some research on what the run times actually look
like for various numbers of workers at various table sizes and
work_mem settings -- but it will be inconvenient for DBAs who actually
know what number of workers they want to use to instead get whatever
value plan_create_index_workers() decide to emit.  They can force it
by setting the parallel_workers reloption, but that affects queries.
They can probably also do it by setting min_parallel_table_scan_size =
0 and  max_parallel_workers_maintenance to whatever value they want,
but I think it would be convenient for there to be a more
straightforward way to do it, or at least some documentation in the
CREATE INDEX page about how to get the number of workers you really
want.  To be clear, I don't think that this is a must-fix issue for
this patch to get committed, but I do think that all reference to
force_parallel_mode=on should go away.

I do not like the way that this patch wants to turn the section of the
documentation on when parallel query can be used into a discussion of
when parallelism can be used.  I think it would be better to leave
that section alone and instead document under CREATE INDEX the
concerns specific to parallel index build. I think this will be easier
for users to understand and far easier to maintain as the number of
parallel DDL operations increases, which I expect it to do somewhat
explosively.  The patch as written says things like "If a utility
statement that is expected to do so does not produce a parallel plan,
..."  but, one, utility statements *do not produce plans of any type*
and, two, the concerns here are really specific to parallel CREATE
INDEX and there is every reason to think that they might be different
in other cases.  I feel strongly that it's enough for this section to
try to explain the concerns that pertain to optimizable queries and
leave utility commands to be treated elsewhere.  If we find that we're
accumulating a lot of documentation for various parallel utility
commands that seems to be duplicative, we can write a general
treatment of that topic that is separate from this one.

The documentation for max_parallel_workers_maintenance cribs from the
documentation for max_parallel_workers_per_gather in saying that we'll
use fewer workers than expected "which may be inefficient".  However,
for parallel CREATE INDEX, that trailing clause is, at least as far as
I can see, not applicable.  For a query, we might choose a Gather over
a Parallel Seq Scan because we think we've got a lot of workers; with
only one participant, we might prefer a GIN index scan.  If it turns
out we don't get the workers, we've got a clearly suboptimal plan.
For CREATE INDEX, though, it seems to me that we don't make any
decisions based on the number of workers we think we'll have.  If we
get fewer workers, it may be slower, but it shouldn't still be as fast
as it can be with that number of workers, which for queries is not the
case.

+     * These fields are not modified throughout the sort.  They primarily
+     * exist for the benefit of worker processes, that need to create BTSpool
+     * state corresponding to that used by the leader.

throughout -> during

remove comma

+     * builds, that must work just the same when an index is built in

remove comma

+     * State that is aggregated by workers, to report back to leader.

State that is maintained by workers and reported back to leader.

+     * indtuples is the total number of tuples that made it into index.

into the index

+     * btleader is only present when a parallel index build is performed, and
+     * only in leader process (actually, only the leader has a BTBuildState.
+     * Workers have their own spool and spool2, though.)

the leader process
period after "process"
capitalize actually

+     * Done.  Leave a way for leader to determine we're finished.  Record how
+     * many tuples were in this worker's share of the relation.

I don't understand what the "Leave a way" comment means.

+ * To support parallel sort operations involving coordinated callers to
+ * tuplesort.c routines across multiple workers, it is necessary to
+ * concatenate each worker BufFile/tapeset into one single leader-wise
+ * logical tapeset.  Workers should have produced one final materialized
+ * tape (their entire output) when this happens in leader; there will always
+ * be the same number of runs as input tapes, and the same number of input
+ * tapes as workers.

I can't interpret the word "leader-wise".  A partition-wise join is a
join done one partition at a time, but a leader-wise logical tape set
is not done one leader at a time.  If there's another meaning to the
affix -wise, I'm not familiar with it.  Don't we just mean "a single
logical tapeset managed by the leader"?

There's a lot here I haven't grokked yet, but I'm running out of
mental energy so I think I'll send this for now and work on this some
more when time permits, hopefully tomorrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Evgeniy Shishkin

Date:

10 January 2018, 23:29:58


> On Jan 10, 2018, at 21:45, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> The documentation for max_parallel_workers_maintenance cribs from the
> documentation for max_parallel_workers_per_gather in saying that we'll
> use fewer workers than expected "which may be inefficient". 

Can we actually call it max_parallel_maintenance_workers instead?
I mean we don't have work_mem_maintenance.

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

11 January 2018, 00:31:06

On Wed, Jan 10, 2018 at 3:29 PM, Evgeniy Shishkin <itparanoia@gmail.com> wrote:
>> On Jan 10, 2018, at 21:45, Robert Haas <robertmhaas@gmail.com> wrote:
>> The documentation for max_parallel_workers_maintenance cribs from the
>> documentation for max_parallel_workers_per_gather in saying that we'll
>> use fewer workers than expected "which may be inefficient".
>
> Can we actually call it max_parallel_maintenance_workers instead?
> I mean we don't have work_mem_maintenance.

Good point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

11 January 2018, 01:05:11

On Wed, Jan 10, 2018 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The addition to README.parallel is basically wrong, because workers
> have been allowed to write WAL since the ParallelContext machinery.
> See the
> XactLastRecEnd handling in parallel.c.  Workers can, for example, due
> HOT cleanups during SELECT scans, just as the leader can.  The
> language here is obsolete anyway in light of commit
> e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way
> to update it.  I'll propose a separate patch for that.

WFM.

> The change to the ParallelContext signature in parallel.h makes an
> already-overlength line even longer.  A line break seems warranted
> just after the first argument, plus pgindent afterward.

Okay.

> I am not a fan of the leader-as-worker terminology.  The leader is not
> a worker, full stop.  I think we should instead talk about whether the
> leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates,
> for example, plus many comment updates).  Similarly, it seems
> SortCoordinateData's nLaunched should be nParticipants, and BTLeader's
> nworkertuplesorts should be nparticipanttuplesorts.

Okay.

> There is also the question of whether we want to respect
> parallel_leader_participation in this context.  The issues which might
> motivate the desire for such behavior in the context of a query do not
> exist when creating a btree index, so maybe we're just making this
> complicated. On the other hand, if some other type of parallel index
> build does end up doing a Gather-like operation then we might regret
> deciding that parallel_leader_participation doesn't apply to index
> builds, so maybe it's OK the way we have it.  On the third hand, the
> complexity of having the leader maybe-participate seems like it
> extends to a fair number of places in the code, and getting rid of all
> that complexity seems appealing.

I only added support for the leader-as-worker case because I assumed
that it was important to have CREATE INDEX process allocation work
"analogously" to parallel query, even though it's clear that the two
situations are not really completely comparable when you dig deep
enough. Getting rid of the leader participating as a worker has
theoretical downsides, but real practical upsides. I am also tempted
to just get rid of it.

> One place where this actually causes a problem is the message changes
> to index_build().  The revised ereport() violates translatability
> guidelines, which require that messages not be assembled from pieces.
> See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES

Noted. Another place where a worker Tuplesortstate in the leader
process causes problems is plan_create_index_workers(), especially
because of things like force_parallel_mode and
parallel_leader_participation.

> A comment added to tuplesort.h says that work_mem should be at least
> 64KB, but does not give any reason.  I think one should be given, at
> least briefly, so that someone looking at these comments in the future
> can, for example, figure out whether the comment is still correct
> after future code changes.  Or else, remove the comment.

The reason for needing to do this is that a naive division of
work_mem/maintenance_work_mem within a caller like nbtsort.c, could,
in general, result in a workMem that is as low as 0 (due to integer
truncation of the result of a division). Clearly *that* is too low. In
fact, we need at least enough memory to store the initial minimal
memtuples array, which needs to respect ALLOCSET_SEPARATE_THRESHOLD.
There is also the matter of having per-tape space for
TAPE_BUFFER_OVERHEAD when we spill to disk (note also the special case
for pass-by-value datum sorts low on memory). There have been a couple
of unavoidable OOM bugs in tuplesort over the years already.

How about I remove the comment, but have tuplesort_begin_common()
force each Tuplesortstate to have workMem that is at least 64KB
(minimum legal work_mem value) in all cases? We can just formalize the
existing assumption that workMem cannot go below 64KB, really, and it
isn't reasonably to use so little workMem within a parallel worker (it
should be prevented by plan_create_index_workers() in the real world,
where parallelism is never artificially forced).

There is no need to make this complicated by worrying about whether or
not 64KB is the true minimum (value that avoids "can't happen"
errors), IMV.

> + * Parallel sort callers are required to coordinate multiple tuplesort states
> + * in a leader process, and one or more worker processes.  The leader process
>
> I think the comma should be removed.  As written it, it looks like we
> are coordinating multiple tuplesort states in a leader process, and,
> separately, we are coordinating one or more worker processes.

Okay.

> Generally, I think the comments in tuplesort.h are excellent.

Thanks.

> I really like the overview of how the new interfaces should be used,
> although I find it slightly wonky that the leader needs two separate
> Tuplesortstates if it wants to participate.

Assuming that we end up actually allowing the leader to participate as
a worker at all, then I think that having that be a separate
Tuplesortstate is better than the alternative. There are a couple of
places where I can see it mattering. For one thing, dtrace compatible
traces become more complicated -- LogicalTapeSetBlocks() is reported
to dtrace within workers (though not via trace_sort logging, where it
is considered redundant). For another, I think we'd need to have
multiple tapesets at the same time for the leader if it only had one
Tuplesortstate, which means multiple new Tuplesortstate fields.

In short, having a distinct Tuplesortstate means almost no special
cases. Maybe you find it slightly wonky because parallel CREATE INDEX
really does have the leader participate as a worker with minimal
caveats. It will do just as much work as a real parallel worker
process, which really is quite a new thing, in a way.

> I don't understand why this patch needs to tinker with the tests in
> vacuum.sql.  The comments say that "If we did not do this, errors
> raised would concern running ANALYZE in parallel mode."  However, why
> should parallel CREATE INDEX having any impact on ANALYZE at all?
> Also, as a practical matter, if I revert those changes, 'make check'
> still passes with or without force_parallel_mode=on.

This certain wasn't true before now -- parallel CREATE INDEX could
previously cause the test to give different output for one error
message. I'll revert that change.

I imagine (though haven't verified) that this happened because, as you
pointed out separately, I didn't get the memo about e9baa5e9 (this is
the commit you mentioned in relation to README.parallel/parallel write
DML).

> I really dislike the fact that this patch invents another thing for
> force_parallel_mode to do.  I invented force_parallel_mode mostly as a
> way of testing that functions were correctly labeled for
> parallel-safety, and I think it would be just fine if it never does
> anything else.

This is not something that I feel strongly about, though I think it is
useful to test parallel CREATE INDEX in low memory conditions, one way
or another.

> I don't think it will be a
> good thing for PostgreSQL if we end up with force_parallel_mode=on as
> a general "use parallelism even though it's stupid" flag, requiring
> supporting code in many different places throughout the code base and
> a laundry list of not-actually-useful behavior changes in the
> documentation.

I will admit that "use parallelism even though it's stupid" is how I
thought of force_parallel_mode=on. I thought of it as a testing option
that users shouldn't need to concern themselves with in almost all
cases. I am not at all attached to what I did with
force_parallel_mode, except that it provides some way to test low
memory conditions, and it was something that I thought you'd expect
from this patch.

> What I think would be a lot more useful, and what I sort of expected
> the patch to have, is a way for a user to explicitly control the
> number of workers requested for a CREATE INDEX operation.

I tend to agree. It wouldn't be *very* compelling, because there
doesn't seem to be that much to how many workers are used anyway, but
it's worth having.

> We all know
> that the cost model is crude and that may be OK -- though it would be
> interesting to see some research on what the run times actually look
> like for various numbers of workers at various table sizes and
> work_mem settings -- but it will be inconvenient for DBAs who actually
> know what number of workers they want to use to instead get whatever
> value plan_create_index_workers() decide to emit.

I did a lot of unpublished research on this over a year ago, and
noticed nothing strange then. I guess I could use the box that
Postgres Pro provided me with access to to revisit it.

> They can force it
> by setting the parallel_workers reloption, but that affects queries.
> They can probably also do it by setting min_parallel_table_scan_size =
> 0 and  max_parallel_workers_maintenance to whatever value they want,
> but I think it would be convenient for there to be a more
> straightforward way to do it, or at least some documentation in the
> CREATE INDEX page about how to get the number of workers you really
> want.  To be clear, I don't think that this is a must-fix issue for
> this patch to get committed, but I do think that all reference to
> force_parallel_mode=on should go away.

The only reason I didn't add a "just use this many parallel workers"
option myself already is that doing so introduces awkward ambiguities.
Long ago, there was a parallel_workers index storage param added by
the patch, which you didn't like because it confused the issue in just
the same way as the table parallel_workers storage param does now,
would have confused parallel index scan, and so on. I counter-argued
that though this was ugly, it seemed to be how it worked on other
systems (more of an explanation than an argument, actually, because I
find it hard to know what to do here).

You're right that there should be a way to simply force the number of
parallel workers for DDL commands that use parallelism. You're also
right to be concerned about that not being a storage parameter (index
or otherwise), because that modifies run time behavior in a surprising
way (even if this pitfall *is* actually something that users of SQL
Server and Oracle have to live with).  Adding something to the CREATE
INDEX grammar just for this *also* seems confusing, because users will
think that it is a storage parameter even though it isn't (I'm pretty
sure that almost no Postgres user can give you a definition of a
storage parameter without some prompting).

I share your general feelings on all of this, but I really don't know
what to do about it. Which of these alternatives is the least worst,
all things considered?

> I do not like the way that this patch wants to turn the section of the
> documentation on when parallel query can be used into a discussion of
> when parallelism can be used.  I think it would be better to leave
> that section alone and instead document under CREATE INDEX the
> concerns specific to parallel index build. I think this will be easier
> for users to understand and far easier to maintain as the number of
> parallel DDL operations increases, which I expect it to do somewhat
> explosively.

WFM.

> The documentation for max_parallel_workers_maintenance cribs from the
> documentation for max_parallel_workers_per_gather in saying that we'll
> use fewer workers than expected "which may be inefficient".  However,
> for parallel CREATE INDEX, that trailing clause is, at least as far as
> I can see, not applicable.

Fair point. Will revise.

> (Various points on phrasing and punctuation)

That all seems fine.

> + * To support parallel sort operations involving coordinated callers to
> + * tuplesort.c routines across multiple workers, it is necessary to
> + * concatenate each worker BufFile/tapeset into one single leader-wise
> + * logical tapeset.  Workers should have produced one final materialized
> + * tape (their entire output) when this happens in leader; there will always
> + * be the same number of runs as input tapes, and the same number of input
> + * tapes as workers.
>
> I can't interpret the word "leader-wise".  A partition-wise join is a
> join done one partition at a time, but a leader-wise logical tape set
> is not done one leader at a time.  If there's another meaning to the
> affix -wise, I'm not familiar with it.  Don't we just mean "a single
> logical tapeset managed by the leader"?

Yes, we do. Will change.

> There's a lot here I haven't grokked yet, but I'm running out of
> mental energy so I think I'll send this for now and work on this some
> more when time permits, hopefully tomorrow.

The good news is that the things that you took issue with were about
what I expected you to take issue with. You seem to be getting through
the review of this patch very efficiently.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

11 January 2018, 01:05:30

On Wed, Jan 10, 2018 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Can we actually call it max_parallel_maintenance_workers instead?
>> I mean we don't have work_mem_maintenance.
>
> Good point.

WFM.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

11 January 2018, 01:21:54

On Wed, Jan 10, 2018 at 5:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> How about I remove the comment, but have tuplesort_begin_common()
> force each Tuplesortstate to have workMem that is at least 64KB
> (minimum legal work_mem value) in all cases? We can just formalize the
> existing assumption that workMem cannot go below 64KB, really, and it
> isn't reasonably to use so little workMem within a parallel worker (it
> should be prevented by plan_create_index_workers() in the real world,
> where parallelism is never artificially forced).

+1.  I think this doesn't even need to be documented.  You can simply
write a comment that says something /* Always allow each worker to use
at least 64kB.  If the amount of memory allowed for the sort is very
small, this might technically cause us to exceed it, but since it's
tiny compared to the overall memory cost of running a worker in the
first place, it shouldn't matter. */

> I share your general feelings on all of this, but I really don't know
> what to do about it. Which of these alternatives is the least worst,
> all things considered?

Let's get the patch committed without any explicit way of forcing the
number of workers and then think about adding that later.

It will be good if you and Rushabh can agree on who will produce the
next version of this patch, and also if I have some idea when that
version should be expected.  On another point, we will need to agree
on how this should be credited in an eventual commit message.  I do
not agree with adding Heikki as an author unless he contributed code,
but we can credit him in some other way, like "Thanks are also due to
Heikki Linnakangas for significant improvements to X, Y, and Z that
made this patch possible."  I assume the author credit will be "Peter
Geoghegan, Rushabh Lathia" in that order, but let me know if anyone
thinks that isn't the right idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Alvaro Herrera

Date:

11 January 2018, 01:36:26

Robert Haas wrote:

> + * To support parallel sort operations involving coordinated callers to
> + * tuplesort.c routines across multiple workers, it is necessary to
> + * concatenate each worker BufFile/tapeset into one single leader-wise
> + * logical tapeset.  Workers should have produced one final materialized
> + * tape (their entire output) when this happens in leader; there will always
> + * be the same number of runs as input tapes, and the same number of input
> + * tapes as workers.
> 
> I can't interpret the word "leader-wise".  A partition-wise join is a
> join done one partition at a time, but a leader-wise logical tape set
> is not done one leader at a time.  If there's another meaning to the
> affix -wise, I'm not familiar with it.  Don't we just mean "a single
> logical tapeset managed by the leader"?

https://www.merriam-webster.com/dictionary/-wise
-wise
adverb combining form
Definition of -wise
1 a : in the manner of crabwise fanwise
b : in the position or direction of slantwise clockwise
2 : with regard to : in respect of dollarwise

I think "one at a time" is not the right way to interpret the affix.
Rather, a "partitionwise join" is a join done "in the manner of
partitions", that is, the characteristics of the partitions are
considered when the join is done.

I'm not defending the "leader-wise" term here, though, because I can't
make sense of it, regardless of how I interpret the -wise affix.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

11 January 2018, 01:42:31

On Wed, Jan 10, 2018 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I share your general feelings on all of this, but I really don't know
>> what to do about it. Which of these alternatives is the least worst,
>> all things considered?
>
> Let's get the patch committed without any explicit way of forcing the
> number of workers and then think about adding that later.

It could be argued that you need some way of forcing low memory in
workers with any committed version. So while this sounds reasonable,
it might not be compatible with throwing out what I've done with
force_parallel_mode up-front, before you commit anything. What do you
think?

> It will be good if you and Rushabh can agree on who will produce the
> next version of this patch, and also if I have some idea when that
> version should be expected.

I'll take it.

> On another point, we will need to agree
> on how this should be credited in an eventual commit message.  I do
> not agree with adding Heikki as an author unless he contributed code,
> but we can credit him in some other way, like "Thanks are also due to
> Heikki Linnakangas for significant improvements to X, Y, and Z that
> made this patch possible."

I agree that I should have been more nuanced with this. Here's what I intended:

Heikki is not the author of any of the code in the final commit, but
he is morally a (secondary) author of the feature as a whole, and
should be credited as such within the final release notes. This is
justified by the history here, which is that he was involved with the
patch fairly early on, and did some work that was particularly
important to the feature, that almost certainly would not otherwise
have happened. Sure, it helped the serial case too, but much less so.
That's really not why he did it.

> I assume the author credit will be "Peter
> Geoghegan, Rushabh Lathia" in that order, but let me know if anyone
> thinks that isn't the right idea.

"Peter Geoghegan, Rushabh Lathia" seems right. Thomas did write a very
small amount of the actual code, but I think it was more of a review
thing (he is already credited as a reviewer).

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

11 January 2018, 01:48:40

On Wed, Jan 10, 2018 at 2:36 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I think "one at a time" is not the right way to interpret the affix.
> Rather, a "partitionwise join" is a join done "in the manner of
> partitions", that is, the characteristics of the partitions are
> considered when the join is done.
>
> I'm not defending the "leader-wise" term here, though, because I can't
> make sense of it, regardless of how I interpret the -wise affix.

I've already conceded the point, but fwiw "leader-wise" comes from the
idea of having a leader-wise space following concatenating worker
tapes (who have original/worker-wise space). We must apply an offset
to get from a worker-wise offset to a leader-wise offset.

This made more sense in an earlier version. I overlooked this during
recent self review.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

11 January 2018, 01:59:08

On Thu, Jan 11, 2018 at 11:42 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> "Peter Geoghegan, Rushabh Lathia" seems right. Thomas did write a very
> small amount of the actual code, but I think it was more of a review
> thing (he is already credited as a reviewer).

+1

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Rushabh Lathia

Date:

11 January 2018, 09:17:50

On Thu, Jan 11, 2018 at 3:35 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jan 10, 2018 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Can we actually call it max_parallel_maintenance_workers instead?
>> I mean we don't have work_mem_maintenance.
>
> Good point.

WFM.

This is good point. I agree with max_parallel_maintenance_workers.

--
Peter Geoghegan

Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

11 January 2018, 22:51:18

On Wed, Jan 10, 2018 at 5:42 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 10, 2018 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I share your general feelings on all of this, but I really don't know
>>> what to do about it. Which of these alternatives is the least worst,
>>> all things considered?
>>
>> Let's get the patch committed without any explicit way of forcing the
>> number of workers and then think about adding that later.
>
> It could be argued that you need some way of forcing low memory in
> workers with any committed version. So while this sounds reasonable,
> it might not be compatible with throwing out what I've done with
> force_parallel_mode up-front, before you commit anything. What do you
> think?

I think the force_parallel_mode thing is too ugly to live.  I'm not
sure that forcing low memory in workers is a thing we need to have,
but if we do, then we'll have to invent some other way to have it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

11 January 2018, 23:06:31

On Thu, Jan 11, 2018 at 11:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think the force_parallel_mode thing is too ugly to live.  I'm not
> sure that forcing low memory in workers is a thing we need to have,
> but if we do, then we'll have to invent some other way to have it.

It might make sense to have the "minimum memory per participant" value
come from a GUC, rather than be hard coded (it's currently hard-coded
to 32MB). I don't think that it's that compelling as a user-visible
option, but it might make sense as a testing option, that we might
very well decide to kill before v11 is released (we might kill it when
we come up with an acceptable interface for "just use this many
workers" in a later commit, which I think we'll definitely end up
doing anyway). By setting the minimum participant memory to 0, you can
then rely on the parallel_workers table storage param forcing the
number of worker processes that we'll request. You can accomplish the
same thing with "min_parallel_table_scan_size = 0", of course.

What do you think of that idea?

To be clear, I'm not actually arguing that we need any of this. My
point about being able to test low memory conditions from the first
commit is that insisting on it is reasonable. I don't actually feel
strongly either way, though, and am not doing any insisting myself.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

11 January 2018, 23:25:26

On Thu, Jan 11, 2018 at 12:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> It might make sense to have the "minimum memory per participant" value
> come from a GUC, rather than be hard coded (it's currently hard-coded
> to 32MB).

> What do you think of that idea?

A third option here is to specifically recognize that
compute_parallel_worker() returned a value based on the table storage
param max_workers, and for that reason alone no "insufficient memory
per participant" decrementing/vetoing should take place. That is, when
the max_workers param is set, perhaps it should be completely
impossible for CREATE INDEX to ignore it for any reason other than an
inability to launch parallel workers (though that could be due to the
max_parallel_workers GUC's setting).

You could argue that we should do this anyway, I suppose.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

12 January 2018, 00:44:52

On Thu, Jan 11, 2018 at 3:25 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jan 11, 2018 at 12:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> It might make sense to have the "minimum memory per participant" value
>> come from a GUC, rather than be hard coded (it's currently hard-coded
>> to 32MB).
>
>> What do you think of that idea?
>
> A third option here is to specifically recognize that
> compute_parallel_worker() returned a value based on the table storage
> param max_workers, and for that reason alone no "insufficient memory
> per participant" decrementing/vetoing should take place. That is, when
> the max_workers param is set, perhaps it should be completely
> impossible for CREATE INDEX to ignore it for any reason other than an
> inability to launch parallel workers (though that could be due to the
> max_parallel_workers GUC's setting).
>
> You could argue that we should do this anyway, I suppose.

Yes, I think this sounds like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

12 January 2018, 01:18:30

On Thu, Jan 11, 2018 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> A third option here is to specifically recognize that
>> compute_parallel_worker() returned a value based on the table storage
>> param max_workers, and for that reason alone no "insufficient memory
>> per participant" decrementing/vetoing should take place. That is, when
>> the max_workers param is set, perhaps it should be completely
>> impossible for CREATE INDEX to ignore it for any reason other than an
>> inability to launch parallel workers (though that could be due to the
>> max_parallel_workers GUC's setting).
>>
>> You could argue that we should do this anyway, I suppose.
>
> Yes, I think this sounds like a good idea.

Cool. I've already implemented this in my local working copy of the
patch. That settles that.

If I'm not mistaken, the only outstanding question at this point is
whether or not we're going to give in and completely remove parallel
leader participation entirely. I suspect that we won't end up doing
that, because while it's not very useful, it's also not hard to
support. Besides, to some extent that's the expectation that has been
established already.

I am not far from posting a revision that incorporates all of your
feedback. Expect that tomorrow afternoon your time at the latest. Of
course, you may have more feedback for me in the meantime. Let me know
if I should hold off on posting a new version.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

12 January 2018, 01:26:40

On Wed, Jan 10, 2018 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> There's a lot here I haven't grokked yet, but I'm running out of
> mental energy so I think I'll send this for now and work on this some
> more when time permits, hopefully tomorrow.

Looking at the logtape changes:

While the patch contains, as I said before, an excellent set of how-to
directions explaining how to use the new parallel sort facilities in
tuplesort.c, there seems to be no such thing for logtape.c, and as a
result I find it a bit unclear how the interface is supposed to work.
I think it would be good to add a similar summary here.

It seems like the words "leader" and "worker" here refer to the leader
of a parallel operation and the associated workers, but do we really
need to make that assumption?  Couldn't we generally describe this as
merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet
into a single LogicalTapeSet that can thereafter be read by the
process that does the merging?

+    /* Pass worker BufFile pieces, and a placeholder leader piece */
+    for (i = 0; i < lts->nTapes; i++)
+    {
+        lt = <s->tapes[i];
+
+        /*
+         * Build concatenated view of all BufFiles, remembering the block
+         * number where each source file begins.
+         */
+        if (i < lts->nTapes - 1)

Unless I'm missing something, the "if" condition just causes the last
pass through this loop to do nothing.  If so, why not just change the
loop condition to i < lts->nTapes - 1 and drop the "if" statement
altogether?

+            char        filename[MAXPGPATH] = {0};

I don't think you need = {0}, because pg_itoa is about to clobber it anyway.

+            /* Alter worker's tape state (generic values okay for leader) */

What do you mean by generic values?

+ * Each tape is initialized in write state.  Serial callers pass ntapes, but
+ * NULL arguments for everything else.  Parallel worker callers pass a
+ * shared handle and worker number, but tapeset should be NULL.  Leader
+ * passes worker -1, a shared handle, and shared tape metadata. These are
+ * used to claim ownership of worker tapes.

This comment doesn't match the actual function definition terribly
well.  Serial callers don't pass NULL for "everything else", because
"int worker" is not going to be NULL.  For parallel workers, it's not
entirely obvious whether "a shared handle" means TapeShare *tapes or
SharedFileSet *fileset.  "tapeset" sounds like an argument name, but
there is no such argument.

lt->max_size looks like it might be an optimization separate from the
overall patch, but maybe I'm wrong about that.

+        /* palloc() larger than MaxAllocSize would fail */
         lt->buffer = NULL;
         lt->buffer_size = 0;
+        lt->max_size = MaxAllocSize;

The comment about palloc() should move down to where you assign max_size.

Generally we avoid returning a struct type, so maybe
LogicalTapeFreeze() should instead grow an out parameter of type
TapeShare * which it populates only if not NULL.

Won't LogicalTapeFreeze() fail an assertion in BufFileExportShared()
if the file doesn't belong to a shared fileset?  If you adopt the
previous suggestion, we can probably just make whether to call this
contingent on whether the TapeShare * out parameter is provided.

I'm not confident I completely understand what's going on with the
logtape stuff yet, so I might have more comments (or better ones)
after I study this further.  To your question about whether to go
ahead and post a new version, I'm OK to keep reviewing this version
for a little longer or to switch to a new one, as you prefer.  I have
not made any local changes, just written a blizzard of email text.
:-p

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

12 January 2018, 04:58:18

On Thu, Jan 11, 2018 at 2:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> While the patch contains, as I said before, an excellent set of how-to
> directions explaining how to use the new parallel sort facilities in
> tuplesort.c, there seems to be no such thing for logtape.c, and as a
> result I find it a bit unclear how the interface is supposed to work.
> I think it would be good to add a similar summary here.

Okay. I came up with something for that.

> It seems like the words "leader" and "worker" here refer to the leader
> of a parallel operation and the associated workers, but do we really
> need to make that assumption?  Couldn't we generally describe this as
> merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet
> into a single LogicalTapeSet that can thereafter be read by the
> process that does the merging?

It's not so much an assumption as it is the most direct way of
referring to these various objects. logtape.c is very clearly a
submodule of tuplesort.c, so this felt okay to me. There are already
several references to what tuplesort.c expects. I'm not going to argue
about it if you insist on this, though I do think that trying to
describe things in more general terms would be a net loss. It would
kind of come off as feigning ignorance IMV. There is nothing that
logtape.c could know less about other than names/roles, and I find it
hard to imagine those changing, even when we add support for
partitioning/distribution sort (where logtape.c handles
"redistribution", something discussed early in this project's
lifetime).

> +    /* Pass worker BufFile pieces, and a placeholder leader piece */
> +    for (i = 0; i < lts->nTapes; i++)
> +    {
> +        lt = <s->tapes[i];
> +
> +        /*
> +         * Build concatenated view of all BufFiles, remembering the block
> +         * number where each source file begins.
> +         */
> +        if (i < lts->nTapes - 1)
>
> Unless I'm missing something, the "if" condition just causes the last
> pass through this loop to do nothing.  If so, why not just change the
> loop condition to i < lts->nTapes - 1 and drop the "if" statement
> altogether?

The last "lt" in the loop is in fact used separately, just outside the
loop. But that use turns out to have been subtly wrong, apparently due
to a problem with converting logtape.c to use the shared buffile
stuff. This buglet would only have caused writing to the leader tape
to break (never trace_sort instrumentation), something that isn't
supported anyway due to the restrictions that shared BufFiles have.
But, we should, on general principle, be able to write to the leader
tape if and when shared buffiles learn to support writing (after
exporting original BufFile in worker).

Buglet fixed in my local working copy. I did so in a way that changes
loop test along the lines you suggest. This should make the whole
design of tape concatenation a bit clearer.

> +            char        filename[MAXPGPATH] = {0};
>
> I don't think you need = {0}, because pg_itoa is about to clobber it anyway.

Okay.

> +            /* Alter worker's tape state (generic values okay for leader) */
>
> What do you mean by generic values?

I mean that the leader's tape doesn't need to have
lt->firstBlockNumber set, because it's empty -- it can remain -1. Same
applies to lt->offsetBlockNumber, too.

I'll remove the text within parenthesis, since it seems redundant
given the structure of the loop.

> + * Each tape is initialized in write state.  Serial callers pass ntapes, but
> + * NULL arguments for everything else.  Parallel worker callers pass a
> + * shared handle and worker number, but tapeset should be NULL.  Leader
> + * passes worker -1, a shared handle, and shared tape metadata. These are
> + * used to claim ownership of worker tapes.
>
> This comment doesn't match the actual function definition terribly
> well.  Serial callers don't pass NULL for "everything else", because
> "int worker" is not going to be NULL.  For parallel workers, it's not
> entirely obvious whether "a shared handle" means TapeShare *tapes or
> SharedFileSet *fileset.  "tapeset" sounds like an argument name, but
> there is no such argument.

Okay. I've tweaked things here.

> lt->max_size looks like it might be an optimization separate from the
> overall patch, but maybe I'm wrong about that.

I think that it's pretty much essential. Currently, the MaxAllocSize
restriction is needed in logtape.c for the same reason that it's
needed anywhere else. Not much to talk about there. The new max_size
thing is about more than that, though -- it's really about not
stupidly allocating up to a full MaxAllocSize when you already know
that you're going to use next to no memory.

You don't have this issue with serial sorts because serial sorts that
only sort a tiny number of tuples never end up as external sorts --
when you end up doing a serial external sort, clearly you're never
going to allocate an excessive amount of memory up front in logtape.c,
because you are by definition operating in a memory constrained
fashion. Not so for parallel external tuplesorts. Think spool2 in a
parallel unique index build, in the case where there are next to no
recently dead tuples (the common case).

> +        /* palloc() larger than MaxAllocSize would fail */
>          lt->buffer = NULL;
>          lt->buffer_size = 0;
> +        lt->max_size = MaxAllocSize;
>
> The comment about palloc() should move down to where you assign max_size.

Okay.

> Generally we avoid returning a struct type, so maybe
> LogicalTapeFreeze() should instead grow an out parameter of type
> TapeShare * which it populates only if not NULL.

Okay. I've modified LogicalTapeFreeze(), adding a "share" output
argument and reverting to returning void, as before.

> Won't LogicalTapeFreeze() fail an assertion in BufFileExportShared()
> if the file doesn't belong to a shared fileset?  If you adopt the
> previous suggestion, we can probably just make whether to call this
> contingent on whether the TapeShare * out parameter is provided.

Oops, you're right. It will be taken care of by the
LogicalTapeFreeze() function change signature change you suggested.

> I'm not confident I completely understand what's going on with the
> logtape stuff yet, so I might have more comments (or better ones)
> after I study this further.  To your question about whether to go
> ahead and post a new version, I'm OK to keep reviewing this version
> for a little longer or to switch to a new one, as you prefer.  I have
> not made any local changes, just written a blizzard of email text.
> :-p

Great. Thanks.

I've caught up with you again. I just need to take a look at what I
came up with with fresh eyes, and maybe do some more testing.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

12 January 2018, 16:19:23

On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>> I agree that plan_create_index_workers() needs to count the leader as a
>> normal worker for the CREATE INDEX.  So what you proposing is - when
>> parallel_leader_participation is true launch (return value of
>> compute_parallel_worker() - 1)
>> workers. true ?
>
> Almost. We need to not subtract one when only one worker is indicated
> by compute_parallel_worker(). I also added some new stuff there, to
> consider edge cases with the parallel_leader_participation GUC.
>
>>> I'm working on fixing up what you posted. I'm probably not more than a
>>> week away from posting a patch that I'm going to mark "ready for
>>> committer". I've already made the change above, and once I spend time
>>> on trying to break the few small changes needed within buffile.c I'll
>>> have taken it as far as I can, most likely.
>>>
>>
>> Okay, once you submit the patch with changes - I will do one round of
>> review for the changes.
>
> I've attached my revision. Changes include:
>

Few observations while skimming through the patch:

1.
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
  {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- OldestXmin = InvalidTransactionId; /* not used */
+ OldestXmin = GetOldestXmin(heapRelation, true);

I think leader and workers should have the same idea of oldestXmin for
the purpose of deciding the visibility of tuples.  I think this is
ensured in all form of parallel query as we do share the snapshot,
however, same doesn't seem to be true for Parallel Index builds.

2.
+
+ /* Wait on worker processes to finish (should be almost instant) */
+ reltuples = _bt_leader_wait_for_workers(buildstate);

Can't we use WaitForParallelWorkersToFinish for this purpose?  The
reason is that if we use a different mechanism here then we might need
a different way to solve the problem related to fork failure.  See
thread [1].  Basically, what if postmaster fails to launch workers due
to fork failure, the leader backend might wait indefinitely.



[1] - https://commitfest.postgresql.org/16/1341/


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

12 January 2018, 17:14:43

On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 1.
> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>   {
> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
> - OldestXmin = InvalidTransactionId; /* not used */
> + OldestXmin = GetOldestXmin(heapRelation, true);
>
> I think leader and workers should have the same idea of oldestXmin for
> the purpose of deciding the visibility of tuples.  I think this is
> ensured in all form of parallel query as we do share the snapshot,
> however, same doesn't seem to be true for Parallel Index builds.

Hmm.  Does it break anything if they use different snapshots?  In the
case of a query that would be disastrous because then you might get
inconsistent results, but if the snapshot is only being used to
determine what is and is not dead then I'm not sure it makes much
difference ... unless the different snapshots will create confusion of
some other sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

12 January 2018, 21:28:49

On Thu, Jan 11, 2018 at 8:58 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I've caught up with you again. I just need to take a look at what I
> came up with with fresh eyes, and maybe do some more testing.

More comments:

BufFileView() looks fairly pointless.  It basically creates a copy of
the input and, in so doing, destroys the input, which is a lot like
returning the input parameter except that it uses more cycles.  It
does do a few things.  First, it zeroes the offsets array instead of
copying the offsets.  But as used, those offsets would have been 0
anyway.  Second, it sets the fileset parameter to NULL.   But that
doesn't actually seem to be important for anything: the fileset is
only used when creating new files, and the BufFile must already be
marked read-only, so we won't be doing that.  It seems like this
function can just be entirely removed and replaced by Assert()-ing
some things about the target in BufFileViewAppend, which I would just
rename to BufFileAppend.

In miscadmin.h, I'd put the prototype for the new GUC next to
max_worker_processes, not maintenance_work_mem.

The ereport() in index_build will, I think, confuse people when it
says that there are 0 parallel workers.  I suggest splitting this into
two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
"building index \"%s\" on table \"%s\" serially" ...) else ereport(...
"building index \"%s\" on table \"%s\" in parallel with request for %d
parallel workers" ...).  Might even need three cases to handle
parallel_leader_participation without needing to assemble the message,
unless we drop parallel_leader_participation support.

The logic in IndexBuildHeapRangeScan() around need_register_snapshot
and OldestXmin seems convoluted and not very well-edited to me.  For
example, need_register_snapshot is set to false in a block that is
only entered when it's already false, and the comment that follows is
supposed to be associated with GetOldestXmin() and makes no sense
here.  I suggest that you go back to the original code organization
and then just insert an additional case for a caller-supplied scan, so
that the overall flow looks like this:

if (scan != NULL)
{
   ...
}
else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
{
   ...
}
else
{
   ...
}

Along with that, I'd change the name of need_register_snapshot to
need_unregister_snapshot (it's doing both jobs right now) and
initialize it to false.  If you enter the second of the above blocks
then change it to true just after snapshot =
RegisterSnapshot(GetTransactionSnapshot()).  Then adjust the comment
that begins "Prepare for scan of the base relation." by inserting an
additional sentence just after that one: "If the caller has supplied a
scan, just use it.  Otherwise, in a normal index build..." and the
rest as it is currently.

+ * This support code isn't reliable when called from within a parallel
+ * worker process due to the fact that our state isn't propagated.  This is
+ * why parallel index builds are disallowed on catalogs.  It is possible
+ * that we'll fail to catch an attempted use of a user index undergoing
+ * reindexing due the non-propagation of this state to workers, which is not
+ * ideal, but the problem is not particularly likely to go undetected due to
+ * our not doing better there.

I understand the first two sentences, but I have no idea what the
third one means, especially the part that says "not particularly
likely to go undetected due to our not doing better there".  It sounds
scary that something bad is only "not particularly likely to go
undetected"; don't we need to detect bad things reliably? But also,
you used the word "not" three times and also the prefix "un-", meaning
"not", once.  Four negations in 13 words! Perhaps I'm not entirely in
a position to cast aspersions on overly-complex phraseology -- the pot
calling the kettle black and all that -- but I bet that will be a lot
clearer if you reduce the number of negations to either 0 or 1.

The comment change in standard_planner() doesn't look helpful to me;
I'd leave it out.

+ * tableOid is the table that index is to be built on.  indexOid is the OID
+ * of a index to be created or reindexed (which must be a btree index).

I'd rewrite that first sentence to end "the table on which the index
is to be built".  The second sentence should say "an index" rather
than "a index".

+ * leaderWorker indicates whether leader will participate as worker or not.
+ * This needs to be taken into account because leader process is guaranteed to
+ * be idle when not participating as a worker, in contrast with conventional
+ * parallel relation scans, where the leader process typically has plenty of
+ * other work to do relating to coordinating the scan, etc.  For CREATE INDEX,
+ * leader is usually treated as just another participant for our scaling
+ * calculation.

OK, I get the first sentence.  But the rest of this appears to be
partially irrelevant and partially incorrect.  The degree to which the
leader is likely to be otherwise occupied isn't very relevant; as long
as we think it's going to do anything at all, we have to account for
it somehow.  Also, the issue isn't that in a query the leader would be
busy "coordinating the scan, etc." but rather that it would have to
read the tuples produced by the Gather (Merge) node.  I think you
could just delete everything from "This needs to be..." through the
end.  You can cover the details of how it's used closer to the point
where you do anything with leaderWorker (or, as I assume it will soon
be, leaderParticipates).

But, actually, I think we would be better off just ripping
leaderWorker/leaderParticipates out of this function altogether.
compute_parallel_worker() is not really under any illusion that it's
computing a number of *participants*; it's just computing a number of
*workers*.  Deducting 1 when the leader is also participating but only
when at least 2 workers were computed leads to an oddity: for a
regular parallel sequential scan, the number of workers increases by 1
when the table size increases by a factor of 3, but here, the number
of workers increases from 1 to 2 when the table size increases by a
factor of 9, and then by 1 for every further multiple of 3.  There
doesn't seem to be any theoretical or practical justification for such
behavior, or with being inconsistent with what parallel sequential
scan does otherwise.  I think it's fine for
parallel_leader_participation=off to simply mean that you get one
fewer participants.  That's actually what would happen with parallel
query, too.  Parallel query would consider
parallel_leader_participation later, in get_parallel_divisor(), when
working out the cost of one path vs. another, but it doesn't use it to
choose the number of workers.  So it seems to me that getting rid of
all of the workerLeader considerations will make it both simpler and
more consistent with what we do for queries.

To be clear, I don't think there's any real need for the cost model we
choose for CREATE INDEX to be the same as the one we use for regular
scans.  The problem with regular scans is that it's very hard to
predict how many workers we can usefully use; it depends not only on
the table size but on what plan nodes get stacked on top of it higher
in the plan tree.  In a perfect world we'd like to add as many workers
as required to avoid having the query be I/O bound and then stop, but
that would require both the ability to predict future system
utilization and a heck of a lot more knowledge than the planner can
hope to have at this point.  If you have an idea how to make a better
cost model than this for CREATE INDEX, I'm willing to consider other
options.  If you don't, or want to propose that as a follow-up patch,
then I think it's OK to use what you've got here for starters.  I just
don't want it to be more baroque than necessary.

I think that the naming of the wait events could be improved.  Right
now, they are named by which kind of process does the waiting, but it
really should be named based on what the thing for which we're
waiting.  I also suggest that we could just write Sort instead of
Tuplesort. In short, I suggest ParallelTuplesortLeader ->
ParallelSortWorkersDone and ParallelTuplesortLeader ->
ParallelSortTapeHandover.

Not for this patch, but I wonder if it might be a worthwhile future
optimization to allow workers to return multiple tapes to the leader.
One doesn't want to go crazy with this, of course.  If the worker
returns 100 tapes, then the leader might get stuck doing multiple
merge passes, which would be a foolish way to divide up the labor, and
even if that doesn't happen, Amdahl's law argues for minimizing the
amount of work that is not done in parallel.  Still, what if a worker
(perhaps after merging) ends up with 2 or 3 tapes?  Is it really worth
merging them so that the leader can do a 5-way merge instead of a
15-way merge?  Maybe this case is rare in practice, because multiple
merge passes will be uncommon with reasonable values of work_mem, and
it might be silly to go to the trouble of firing up workers if they'll
only generate a few runs in total.  Just a thought.

+     * Make sure that the temp file(s) underlying the tape set are created in
+     * suitable temp tablespaces.  This is only really needed for serial
+     * sorts.

This comment makes me wonder whether it is "sorta" needed for parallel sorts.

-    if (trace_sort)
+    if (trace_sort && !WORKER(state))

I have a feeling we still want to get this output even from workers,
but maybe I'm missing something.

+      arg5 indicates serial, parallel worker, or parallel leader sort.</entry>

I think it should say what values are used for each case.

+    /* Release worker tuplesorts within leader process as soon as possible */

IIUC, the worker tuplesorts aren't really holding onto much of
anything in terms of resources.  I think it might be better to phrase
this as /* The sort we just did absorbed the final tapes produced by
these tuplesorts, which are of no further use. */ or words to that
effect.

Instead of making a special case in CreateParallelContext for
serializable_okay, maybe index_build should just use SetConfigOption()
to force the isolation level to READ COMMITTED right after it does
NewGUCNestLevel().  The change would only be temporary because the
subsequent call to AtEOXact_GUC() will revert it.  The point isn't
really that CREATE INDEX is somehow exempt from the problem that
SIREAD locks haven't been updated to work correctly with parallelism;
it's that CREATE INDEX itself is defined to ignore serializability
concerns.

There is *still* more to review here, but my concentration is fading.
If you could post an updated patch after adjusting for the comments
above, I think that would be helpful.  I'm not totally out of things
to review that I haven't already looked over once, but I think I'm
close.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

12 January 2018, 22:55:07

On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> 1.
>> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>>   {
>> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
>> - OldestXmin = InvalidTransactionId; /* not used */
>> + OldestXmin = GetOldestXmin(heapRelation, true);
>>
>> I think leader and workers should have the same idea of oldestXmin for
>> the purpose of deciding the visibility of tuples.  I think this is
>> ensured in all form of parallel query as we do share the snapshot,
>> however, same doesn't seem to be true for Parallel Index builds.
>
> Hmm.  Does it break anything if they use different snapshots?  In the
> case of a query that would be disastrous because then you might get
> inconsistent results, but if the snapshot is only being used to
> determine what is and is not dead then I'm not sure it makes much
> difference ... unless the different snapshots will create confusion of
> some other sort.

I think that this is fine. GetOldestXmin() is only used when we have a
ShareLock on the heap relation, and the snapshot is SnapshotAny. We're
only talking about the difference between HEAPTUPLE_DEAD and
HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't
strictly necessary by the time you got to it is normal.

However, it's not okay that GetOldestXmin()'s second argument is true
in the patch, rather than PROCARRAY_FLAGS_VACUUM. That's due to bitrot
that was not caught during some previous rebase (commit af4b1a08
changed the signature). Will fix.

You've given me a lot more to work through in your most recent mail,
Robert. I will probably get the next revision to you on Monday.
Doesn't seem like there is much point in posting what I've done so
far.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

13 January 2018, 15:32:42

On Sat, Jan 13, 2018 at 1:25 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> 1.
>>> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>>>   {
>>> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
>>> - OldestXmin = InvalidTransactionId; /* not used */
>>> + OldestXmin = GetOldestXmin(heapRelation, true);
>>>
>>> I think leader and workers should have the same idea of oldestXmin for
>>> the purpose of deciding the visibility of tuples.  I think this is
>>> ensured in all form of parallel query as we do share the snapshot,
>>> however, same doesn't seem to be true for Parallel Index builds.
>>
>> Hmm.  Does it break anything if they use different snapshots?  In the
>> case of a query that would be disastrous because then you might get
>> inconsistent results, but if the snapshot is only being used to
>> determine what is and is not dead then I'm not sure it makes much
>> difference ... unless the different snapshots will create confusion of
>> some other sort.
>
> I think that this is fine. GetOldestXmin() is only used when we have a
> ShareLock on the heap relation, and the snapshot is SnapshotAny. We're
> only talking about the difference between HEAPTUPLE_DEAD and
> HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't
> strictly necessary by the time you got to it is normal.
>

Yeah, but this would mean that now with parallel create index, it is
possible that some tuples from the transaction would end up in index
and others won't.   In general, this makes me slightly nervous mainly
because such a case won't be possible without the parallel option for
create index, but if you and Robert are okay with it as there is no
fundamental problem, then we might as well leave it as it is or maybe
add a comment saying so.

Another point is that the information about broken hot chains
indexInfo->ii_BrokenHotChain is getting lost.  I think you need to
coordinate this information among backends that participate in
parallel create index.  Test to reproduce the problem is as below:

create table tbrokenchain(c1 int, c2 varchar);
insert into tbrokenchain values(3, 'aaa');

begin;
set force_parallel_mode=on;
update tbrokenchain set c2 = 'bbb' where c1=3;
create index idx_tbrokenchain on tbrokenchain(c1);
commit;

Now, check the value of indcheckxmin in pg_index, it should be true,
but with patch it is false.  You can try with patch by not changing
the value of force_parallel_mode;

The patch uses both parallel_leader_participation and
force_parallel_mode, but it seems the definition is different from
what we have in Gather.  Basically, even with force_parallel_mode, the
leader is participating in parallel build. I see there is some
discussion above about both these parameters and still, there is not
complete agreement on the best way forward.  I think we should have
parallel_leader_participation as that can help in testing if nothing
else.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

13 January 2018, 16:34:09

On Sat, Jan 13, 2018 at 6:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> The patch uses both parallel_leader_participation and
> force_parallel_mode, but it seems the definition is different from
> what we have in Gather.  Basically, even with force_parallel_mode, the
> leader is participating in parallel build. I see there is some
> discussion above about both these parameters and still, there is not
> complete agreement on the best way forward.  I think we should have
> parallel_leader_participation as that can help in testing if nothing
> else.
>

Or maybe just have force_parallel_mode.  I think one of these is
required to facilitate some form of testing of the parallel code
easily.  As you can see from my previous email, it was quite easy to
demonstrate a test with force_parallel_mode.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

13 January 2018, 23:13:03

On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, but this would mean that now with parallel create index, it is
> possible that some tuples from the transaction would end up in index
> and others won't.

You mean some tuples from some past transaction that deleted a bunch
of tuples and committed, but not before someone acquired a still-held
snapshot that didn't see the deleter's transaction as committed yet?

I guess that that is different, but it doesn't matter. All that
matters is that in the end, the index contains all entries for all
heap tuples visible to any possible snapshot (though possibly
excluding some existing old snapshots iff we detect broken HOT chains
during builds).

> In general, this makes me slightly nervous mainly
> because such a case won't be possible without the parallel option for
> create index, but if you and Robert are okay with it as there is no
> fundamental problem, then we might as well leave it as it is or maybe
> add a comment saying so.

Let me try to explain this another way, in terms of the high-level
intuition that I have about it (Robert can probably skip this part).

GetOldestXmin() returns a value that is inherently a *conservative*
cut-off. In hot standby mode, it's possible for the value it returns
to go backwards from a value previously returned within the same
backend.

Even with serial builds, the exact instant that GetOldestXmin() gets
called could vary based on something like the OS scheduling of the
process that runs CREATE INDEX. It could have a different value based
only on that. It follows that it won't matter if parallel CREATE INDEX
participants have a slightly different value, because the cut-off is
all about the consistency of the index with what the universe of
possible snapshots could see in the heap, not the consistency of
different parts of the index with each other (the parts produced from
heap tuples read from each participant).

Look at how the pg_visibility module calls GetOldestXmin() to recheck
-- it has to call GetOldestXmin() a second time, with a buffer lock
held on a heap page throughout. It does this to conclusively establish
that the visibility map is corrupt (otherwise, it could just be that
the cut-off became stale).

Putting all of this together, it would be safe for the
HEAPTUPLE_RECENTLY_DEAD case within IndexBuildHeapRangeScan() to call
GetOldestXmin() again (a bit like pg_visibility does), to avoid having
to index an actually-fully-dead-by-now tuple (we could call
HeapTupleSatisfiesVacuum() a second time for the heap tuple, hoping to
get HEAPTUPLE_DEAD the second time around). This optimization wouldn't
work out a lot of the time (it would only work out when an old
snapshot went away during the CREATE INDEX), and would add
procarraylock traffic, so we don't do it. But AFAICT it's feasible.

> Another point is that the information about broken hot chains
> indexInfo->ii_BrokenHotChain is getting lost.  I think you need to
> coordinate this information among backends that participate in
> parallel create index.  Test to reproduce the problem is as below:
>
> create table tbrokenchain(c1 int, c2 varchar);
> insert into tbrokenchain values(3, 'aaa');
>
> begin;
> set force_parallel_mode=on;
> update tbrokenchain set c2 = 'bbb' where c1=3;
> create index idx_tbrokenchain on tbrokenchain(c1);
> commit;
>
> Now, check the value of indcheckxmin in pg_index, it should be true,
> but with patch it is false.  You can try with patch by not changing
> the value of force_parallel_mode;

Ugh, you're right. That's a real howler. Will fix.

Note that my stress-testing strategy has had a lot to do with
verifying that a serial build has relfiles that are physically
identical to parallel builds. Obviously that couldn't have caught
this, because this only concerns the state of the pg_index catalog.

> The patch uses both parallel_leader_participation and
> force_parallel_mode, but it seems the definition is different from
> what we have in Gather.  Basically, even with force_parallel_mode, the
> leader is participating in parallel build. I see there is some
> discussion above about both these parameters and still, there is not
> complete agreement on the best way forward.  I think we should have
> parallel_leader_participation as that can help in testing if nothing
> else.

I think that you're quite right that parallel_leader_participation
needs to be supported for testing purposes. I had some sympathy for
the idea that we should remove leader participation as a worker from
the patch entirely, but the testing argument seems to clinch it. I'm
fine with killing force_parallel_mode, though, because it will be
possible to force the use of parallelism by using the existing
parallel_workers table storage param in the next version of the patch,
regardless of how small the table is.

Thanks for the review.
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

15 January 2018, 07:25:25

On Sun, Jan 14, 2018 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Yeah, but this would mean that now with parallel create index, it is
>> possible that some tuples from the transaction would end up in index
>> and others won't.
>
> You mean some tuples from some past transaction that deleted a bunch
> of tuples and committed, but not before someone acquired a still-held
> snapshot that didn't see the deleter's transaction as committed yet?
>

I think I am talking about something different.  Let me try to explain
in some more detail.  Consider a transaction T-1 has deleted two
tuples from tab-1, first on page-1 and second on page-2 and committed.
There is a parallel transaction T-2 which has an open snapshot/query
due to which oldestXmin will be smaller than T-1.   Now, in another
session, we started parallel Create Index on tab-1 which has launched
one worker.  The worker decided to scan page-1 and will found that the
deleted tuple on page-1 is Recently Dead, so will include it in Index.
In the meantime transaction, T-2 got committed/aborted which allows
oldestXmin to be greater than the value of transaction T-1 and now
leader decides to scan the page-2 with freshly computed oldestXmin and
found that the tuple on that page is Dead and has decided not to
include it in the index.  So, this leads to a situation where some
tuples deleted by the transaction will end up in index whereas others
won't.  Note that I am not arguing that there is any fundamental
problem with this, but just want to highlight that such a case doesn't
seem to exist with Create Index.

>
>> The patch uses both parallel_leader_participation and
>> force_parallel_mode, but it seems the definition is different from
>> what we have in Gather.  Basically, even with force_parallel_mode, the
>> leader is participating in parallel build. I see there is some
>> discussion above about both these parameters and still, there is not
>> complete agreement on the best way forward.  I think we should have
>> parallel_leader_participation as that can help in testing if nothing
>> else.
>
> I think that you're quite right that parallel_leader_participation
> needs to be supported for testing purposes. I had some sympathy for
> the idea that we should remove leader participation as a worker from
> the patch entirely, but the testing argument seems to clinch it. I'm
> fine with killing force_parallel_mode, though, because it will be
> possible to force the use of parallelism by using the existing
> parallel_workers table storage param in the next version of the patch,
> regardless of how small the table is.
>

Okay, this makes sense to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

15 January 2018, 23:33:12

On Sun, Jan 14, 2018 at 8:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Jan 14, 2018 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Yeah, but this would mean that now with parallel create index, it is
>>> possible that some tuples from the transaction would end up in index
>>> and others won't.
>>
>> You mean some tuples from some past transaction that deleted a bunch
>> of tuples and committed, but not before someone acquired a still-held
>> snapshot that didn't see the deleter's transaction as committed yet?
>>
>
> I think I am talking about something different.  Let me try to explain
> in some more detail.  Consider a transaction T-1 has deleted two
> tuples from tab-1, first on page-1 and second on page-2 and committed.
> There is a parallel transaction T-2 which has an open snapshot/query
> due to which oldestXmin will be smaller than T-1.   Now, in another
> session, we started parallel Create Index on tab-1 which has launched
> one worker.  The worker decided to scan page-1 and will found that the
> deleted tuple on page-1 is Recently Dead, so will include it in Index.
> In the meantime transaction, T-2 got committed/aborted which allows
> oldestXmin to be greater than the value of transaction T-1 and now
> leader decides to scan the page-2 with freshly computed oldestXmin and
> found that the tuple on that page is Dead and has decided not to
> include it in the index.  So, this leads to a situation where some
> tuples deleted by the transaction will end up in index whereas others
> won't.  Note that I am not arguing that there is any fundamental
> problem with this, but just want to highlight that such a case doesn't
> seem to exist with Create Index.

I must have not done a good job of explaining myself ("You mean some
tuples from some past transaction..."), because this is exactly what I
meant, and was exactly how I understood your original remarks from
Saturday.

In summary, while I do agree that this is different to what we see
with serial index builds, I still don't think that this is a concern
for us.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

16 January 2018, 03:54:40

On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> More comments:

Attached patch has all open issues worked through, including those
that I respond to or comment on below, as well as the other feedback
from your previous e-mails. Note also that I fixed the issue that Amit
raised, as well as the GetOldestXmin()-argument bug that I noticed in
passing when responding to Amit. I also worked on the attribution in
the commit message.

Before getting to my responses to your most recent round of feedback,
I want to first talk about some refactoring that I decided to do. As
you can see from the master branch, tuplesort_performsort() isn't
necessarily reached for spool2, even when we start out with a spool2
(that is, for many unique index builds, spool2 never even does a
tuplesort_performsort()). We may instead decide to shut down spool2
when it has no (dead) tuples. I made this work just as well for the
parallel case in this latest revision. I had to teach tuplesort.c to
accept an early tuplesort_end() for LEADER() -- it had to be prepared
to release still-waiting workers in some cases, rather than depending
on nbtsort.c having called tuplesort_performsort() already. Several
routines within nbtsort.c that previously knew something about
parallelism now know nothing about it. This seems like a nice win.

Separately, I took advantage of the fact that within the leader, its
*worker* Tuplesortstate can safely call tuplesort_end() before the
leader state's tuplesort_performsort() call.

The overall effect of these two changes is that there is now a
_bt_leader_heapscan() call for the parallel case that nicely mirrors
the serial case's IndexBuildHeapScan() call, and once we're done with
populating spools, no subsequent code needs to know a single thing
about parallelism as a special case. You may notice some small changes
to the tuplesort.h overview, which now advertises that callers can
take advantage of this leeway.

Now on to my responses to your most recent round of feeback...

> BufFileView() looks fairly pointless.  It basically creates a copy of
> the input and, in so doing, destroys the input, which is a lot like
> returning the input parameter except that it uses more cycles.  It
> does do a few things.

While it certainly did occur to me that that was kind of weird, and I
struggled with it on my own for a little while, I ultimately agreed
with Thomas that it added something to have ltsConcatWorkerTapes()
call some buffile function in every iteration of its loop.
(BufFileView() + BufFileViewAppend() are code that Thomas actually
wrote, though I added the asserts and comments myself.)

If you think about this in terms of the interface rather than the
implementation, then it may make more sense. The encapsulation adds
something which might pay off later, such as when extendBufFile()
needs to work with a concatenated set of BufFiles. And even right now,
I cannot simply reuse the BufFile without then losing the assert that
is currently in BufFileViewAppend() (must not have associated shared
fileset assert). So I'd end up asserting less (rather than more) there
if BufFileView() was removed.

It wastes some cycles to not simply use the BufFile directly, but not
terribly many in the grand scheme of things. This happens once per
external sort operation.

> In miscadmin.h, I'd put the prototype for the new GUC next to
> max_worker_processes, not maintenance_work_mem.

But then I'd really have to put it next to max_worker_processes in
globals.c, too. That would mean that it would go under "Primary
determinants of sizes of shared-memory structures" within globals.c,
which seems wrong to me. What do you think?

> The ereport() in index_build will, I think, confuse people when it
> says that there are 0 parallel workers.  I suggest splitting this into
> two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
> "building index \"%s\" on table \"%s\" serially" ...) else ereport(...
> "building index \"%s\" on table \"%s\" in parallel with request for %d
> parallel workers" ...).

WFM. I've simply dropped any reference to leader participation in the
messages here, to keep things simple. This seemed okay because the
only thing that affects leader participation is the
parallel_leader_participation GUC, which is under the user's direct
control at all times, and is unlikely to be changed. Those that really
want further detail have trace_sort for that.

> The logic in IndexBuildHeapRangeScan() around need_register_snapshot
> and OldestXmin seems convoluted and not very well-edited to me.

Having revisited it, I now agree that the code added to
IndexBuildHeapRangeScan() was unclear, primarily in that the
need_unregister_snapshot local variable was overloaded in a weird way.

> I suggest that you go back to the original code organization
> and then just insert an additional case for a caller-supplied scan, so
> that the overall flow looks like this:
>
> if (scan != NULL)
> {
>    ...
> }
> else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
> {
>    ...
> }
> else
> {
>    ...
> }

The problem that I see with this alternative flow is that the "if
(scan != NULL)" and the "else if (IsBootstrapProcessingMode() ||
indexInfo->ii_Concurrent)" blocks clearly must contain code for two
distinct, non-overlapping cases, despite the fact those two cases
actually do overlap somewhat. That is, a call to
IndexBuildHeapRangeScan() may have a (parallel) heap scan argument
(control reaches your first code block), or may not (control reaches
your second or third code block). At the same time, a call to
IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX),
or may need an MVCC snapshot (either by registering its own, or using
the parallel one). These two things are orthogonal.

I think I still get the gist of what you're saying, though. I've come
up with a new structure that is a noticeable improvement on what I
had. Importantly, the new structure let me add a number of
parallelism-agnostic asserts that make sure that every ambuild routine
that supports parallelism gets the details right.

> Along with that, I'd change the name of need_register_snapshot to
> need_unregister_snapshot (it's doing both jobs right now) and
> initialize it to false.

Done.

> + * This support code isn't reliable when called from within a parallel
> + * worker process due to the fact that our state isn't propagated.  This is
> + * why parallel index builds are disallowed on catalogs.  It is possible
> + * that we'll fail to catch an attempted use of a user index undergoing
> + * reindexing due the non-propagation of this state to workers, which is not
> + * ideal, but the problem is not particularly likely to go undetected due to
> + * our not doing better there.
>
> I understand the first two sentences, but I have no idea what the
> third one means, especially the part that says "not particularly
> likely to go undetected due to our not doing better there".  It sounds
> scary that something bad is only "not particularly likely to go
> undetected"; don't we need to detect bad things reliably?

The primary point here, that you said you understood, is that we
definitely need to detect when we're reindexing a catalog index within
the backend, so that systable_beginscan() can do the right thing and
not use the index (we also must avoid assertion failures). My solution
to that problem is, of course, to not allow the use of parallel create
index when REINDEXing a system catalog. That seems 100% fine to me.

There is a little bit of ambiguity about other cases, though -- that's
the secondary point I tried to make within that comment block, and the
part that you took issue with. To put this secondary point another
way: It's possible that we'd fail to detect it if someone's comparator
went bananas and decided it was okay to do SQL access (that resulted
in an index scan of the index undergoing reindex). That does seem
rather unlikely, but I felt it necessary to say something like this
because ReindexIsProcessingIndex() isn't already something that only
deals with catalog indexes -- it works with all indexes.

Anyway, I reworded this. I hope that what I came up with is clearer than before.

> But also,
> you used the word "not" three times and also the prefix "un-", meaning
> "not", once.  Four negations in 13 words! Perhaps I'm not entirely in
> a position to cast aspersions on overly-complex phraseology -- the pot
> calling the kettle black and all that -- but I bet that will be a lot
> clearer if you reduce the number of negations to either 0 or 1.

You're not wrong. Simplified.

> The comment change in standard_planner() doesn't look helpful to me;
> I'd leave it out.

Okay.

> + * tableOid is the table that index is to be built on.  indexOid is the OID
> + * of a index to be created or reindexed (which must be a btree index).
>
> I'd rewrite that first sentence to end "the table on which the index
> is to be built".  The second sentence should say "an index" rather
> than "a index".

Okay.

> But, actually, I think we would be better off just ripping
> leaderWorker/leaderParticipates out of this function altogether.
> compute_parallel_worker() is not really under any illusion that it's
> computing a number of *participants*; it's just computing a number of
> *workers*.

That distinction does seem to cause plenty of confusion. While I
accept what you say about compute_parallel_worker(), I still haven't
gone as far as removing the leaderParticipates argument altogether,
because compute_parallel_worker() isn't the only thing that matters
here. (More on that below.)

> I think it's fine for
> parallel_leader_participation=off to simply mean that you get one
> fewer participants.  That's actually what would happen with parallel
> query, too.  Parallel query would consider
> parallel_leader_participation later, in get_parallel_divisor(), when
> working out the cost of one path vs. another, but it doesn't use it to
> choose the number of workers.  So it seems to me that getting rid of
> all of the workerLeader considerations will make it both simpler and
> more consistent with what we do for queries.

I was aware of those details, and figured that parallel query fudges
the compute_parallel_worker() figure's leader participation in some
sense, and that that was what I needed to compensate for. After all,
when parallel_leader_participation=off, having
compute_parallel_worker() return 1 means rather a different thing to
what it means with parallel_leader_participation=on, even though in
general we seem to assume that parallel_leader_participation can only
make a small difference overall.

Here's what I've done based on your feedback: I've changed the header
comments, but stopped leaderParticipates from affecting the
compute_parallel_worker() calculation (so, as I said,
leaderParticipates stays). The leaderParticipates argument continues
to affect these two aspects of plan_create_index_workers()'s return
value:

1. It continues to be used so we have a total number of participants
(not workers) to apply our must-have-32MB-workMem limit on
participants.

Parallel query has no equivalent of this, and it seems warranted. Note
that this limit is no longer applied when parallel_workers storage
param was set, as discussed.

2. I continue to use the leaderParticipates argument to disallow the
case where there is only one CREATE INDEX participant but parallelism
is in use, because, of course, that clearly makes no sense -- we
should just use a serial sort instead.

(It might make sense to allow this if parallel_leader_participation
was *purely* a testing GUC, only for use by by backend hackers, but
AFAICT it isn't.)

The planner can allow a single participant parallel sequential scan
path to be created without worrying about the fact that that doesn't
make much sense, because a plan with only one parallel participant is
always going to cost more than some serial plan (you will only get a 1
participant parallel sequential scan when force_parallel_mode is on).
Obviously plan_create_index_workers() doesn't generate (partial) paths
at all, so I simply have to get the same outcome (avoiding a senseless
1 participant parallel operation) some other way here.

> If you have an idea how to make a better
> cost model than this for CREATE INDEX, I'm willing to consider other
> options.  If you don't, or want to propose that as a follow-up patch,
> then I think it's OK to use what you've got here for starters.  I just
> don't want it to be more baroque than necessary.

I suspect that the parameters of any cost model for parallel CREATE
INDEX that we're prepared to consider for v11 are: "Use a number of
parallel workers that is one below the number at which the total
duration of the CREATE INDEX either stays the same or goes up".

It's hard to do much better than this within those parameters. I can
see a fairly noticeable benefit to parallelism with 4 parallel workers
and a measly 1MB of maintenance_work_mem (when parallelism is forced)
relative to the serial case with the same amount of memory. At least
on my laptop, it seems to be rather hard to lose relative to a serial
sort when using parallel CREATE INDEX (to be fair, I'm probably
actually using way more memory than 1MB to do this due to FS cache
usage). I can think of a cleverer approach to costing parallel CREATE
INDEX, but it's only cleverer by weighing distributed costs. Not very
relevant, for the time being.

BTW, the 32MB per participant limit within plan_create_index_workers()
was chosen based on the realization that any higher value would make
having a default setting of 2 for max_parallel_maintenance_workers (to
match the max_parallel_workers_per_gather default) pointless when the
default maintenance_work_mem value of 64MB is in use. That's not
terribly scientific, though it at least doesn't come at the expense of
a more scientific idea for a limit like that (I don't actually have
one, you see). I am merely trying to avoid being *gratuitously*
wasteful of shared resources that are difficult to accurately cost in
(e.g., the distributed cost of random I/O to the system as a whole
when we do a parallel index build while ridiculously low on
maintenance_work_mem).

> I think that the naming of the wait events could be improved.  Right
> now, they are named by which kind of process does the waiting, but it
> really should be named based on what the thing for which we're
> waiting.  I also suggest that we could just write Sort instead of
> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
> ParallelSortWorkersDone and ParallelTuplesortLeader ->
> ParallelSortTapeHandover.

WFM. Also added documentation for the wait events to monitoring.sgml,
which I somehow missed the first time around.

> Not for this patch, but I wonder if it might be a worthwhile future
> optimization to allow workers to return multiple tapes to the leader.
> One doesn't want to go crazy with this, of course.  If the worker
> returns 100 tapes, then the leader might get stuck doing multiple
> merge passes, which would be a foolish way to divide up the labor, and
> even if that doesn't happen, Amdahl's law argues for minimizing the
> amount of work that is not done in parallel.  Still, what if a worker
> (perhaps after merging) ends up with 2 or 3 tapes?  Is it really worth
> merging them so that the leader can do a 5-way merge instead of a
> 15-way merge?

I did think about this myself, or rather I thought specifically about
building a serial/bigserial PK during pg_restore, a case that must be
very common. The worker merges for such an index build will typically
be *completely pointless* when all input runs are in sorted order,
because the merge heap will only need to consult the root of the heap
and its two immediate children throughout (commit 24598337c helped
cases like this enormously). You might as well merge hundreds of runs
in the leader, provided you still have enough memory per tape that you
can get the full benefit of OS readahead (this is not that hard when
you're only going to repeatedly read from the same tape anyway).

I'm not too worried about it, though. The overall picture is still
very positive even in this case. The "extra worker merging" isn't
generally a big proportion of the overall cost, especially there. More
importantly, if I tried to do better, it would be the "quicksort with
spillover" cost model story all over again (remember how tedious that
was?). How hard are we prepared to work to ensure that we get it right
when it comes to skipping worker merging, given that users always pay
some overhead, even when that doesn't happen?

Note also that parallel index builds manage to unfairly *gain*
advantage over serial cases (they have the good variety of dumb luck,
rather than the bad variety) in certain other common cases.  This
happens with an *inverse* physical/logical correlation (e.g. a DESC
index builds on a date field). They manage to artificially do better
than theory would predict, simply because a greater number of smaller
quicksorts are much faster during initial run generation, without also
taking a concomitant performance hit at merge time. Thomas showed this
at one point. Note that even that's only true because of the qsort
precheck (what I like to call the "banana skin prone" precheck, that
we added to our qsort implementation in 2006) -- it would be true for
*all* correlations, but that one precheck thing complicates matters.

All of this is a tricky business, and that isn't going to get any easier IMV.

> +     * Make sure that the temp file(s) underlying the tape set are created in
> +     * suitable temp tablespaces.  This is only really needed for serial
> +     * sorts.
>
> This comment makes me wonder whether it is "sorta" needed for parallel sorts.

I removed "really". The point of the comment is that we've already set
up temp tablespaces for the shared fileset in the parallel case.
Shared filesets figure out which tablespaces will be used up-front --
see SharedFileSetInit().

> -    if (trace_sort)
> +    if (trace_sort && !WORKER(state))
>
> I have a feeling we still want to get this output even from workers,
> but maybe I'm missing something.

I updated tuplesort_end() so that trace_sort reports on the end of the
sort, even for worker processes. (We still don't show generic
tuplesort_begin* message for workers, though.)

> +      arg5 indicates serial, parallel worker, or parallel leader sort.</entry>
>
> I think it should say what values are used for each case.

I based this on "arg0 indicates heap, index or datum sort", where it's
implied that the values are respective to the order that they appear
in in the sentence (starting from 0). But okay, I'll do it that way
all the same.

> +    /* Release worker tuplesorts within leader process as soon as possible */
>
> IIUC, the worker tuplesorts aren't really holding onto much of
> anything in terms of resources.  I think it might be better to phrase
> this as /* The sort we just did absorbed the final tapes produced by
> these tuplesorts, which are of no further use. */ or words to that
> effect.

Okay. Done that way.

> Instead of making a special case in CreateParallelContext for
> serializable_okay, maybe index_build should just use SetConfigOption()
> to force the isolation level to READ COMMITTED right after it does
> NewGUCNestLevel().  The change would only be temporary because the
> subsequent call to AtEOXact_GUC() will revert it.

I tried doing it that way, but it doesn't seem workable:

postgres=# begin transaction isolation level serializable ;
BEGIN
postgres=*# reindex index test_unique;
ERROR:  25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
LOCATION:  call_string_check_hook, guc.c:9953

Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify
default_transaction_isolation -- not transaction_isolation.

Instead, I added a bit more to comments within
CreateParallelContext(), to justify what I've done along the lines you
went into. Hopefully this works better for you.

> There is *still* more to review here, but my concentration is fading.
> If you could post an updated patch after adjusting for the comments
> above, I think that would be helpful.  I'm not totally out of things
> to review that I haven't already looked over once, but I think I'm
> close.

I'm impressed with how quickly you're getting through review of the
patch. Hopefully we can keep that momentum up.

Thanks
-- 
Peter Geoghegan

Attachment

0001-Add-parallel-B-tree-index-build-sorting.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Prabhat Sahu

Date:

16 January 2018, 12:17:32

Hi all,

I have been continue doing testing of parallel create index patch. So far

I haven't came across any sort of issue or regression with the patches.

Here are few performance number for the latest round of testing - which

is perform on top of 6th Jan patch submitted by Peter.

Testing is done on openstack instance with:

CUP: 8

RAM : 16GB

HD: 640 GB

postgres=# select pg_size_pretty(pg_total_relation_size

('lineitem'));

pg_size_pretty

----------------

93 GB

(1 row)

-- Test 1.

max_parallel_workers_maintenance = 2

max_parallel_workers = 16

max_parallel_workers_per_gather = 8

maintenance_work_mem = 1GB

max_wal_size = 4GB

-- Test 2.

max_parallel_workers_maintenance = 4

max_parallel_workers = 16

max_parallel_workers_per_gather = 8

maintenance_work_mem = 2GB

max_wal_size = 4GB

-- Test 3.

max_parallel_workers_maintenance = 8

max_parallel_workers = 16

max_parallel_workers_per_gather = 8

maintenance_work_mem = 4GB

max_wal_size = 4GB

NOTE: All the time taken entries are the median of 3 consecutive runs for the same B-tree index creation query.

	Time taken for Parallel Index createion:
	Test 1			Test 2			Test 3
Simple/Composite Indexes:	Without patch	With patch , max_parallel_workers_maintenance = 2	% Change	Without patch	With patch, max_parallel_workers_maintenance = 4	% Change	Without patch	With patch, max_parallel_workers_maintenance = 8	% Change
Index on "bigint" column: CREATE INDEX li_ordkey_idx1 ON lineitem(l_orderkey);	1062446.462 ms (17:42.446)	1024972.273 ms (17:04.972)	3.52 %	1053468.945 ms (17:33.469)	896375.543 ms (14:56.376)	17.75 %	1082920.703 ms (18:02.921)	932550.058 ms (15:32.550)	13.88 %
index on "integer" column: CREATE INDEX li_lineno_idx2 ON lineitem(l_linenumber);	1538285.499 ms (25:38.285)	1201008.423 ms (20:01.008)	21.92 %	1529837.023 ms (25:29.837)	1014188.489 ms (16:54.188)	33.70 %	1642160.947 ms (27:22.161)	978518.253 ms (16:18.518)	40.41 %
index on "numeric" column: CREATE INDEX li_qty_idx3 ON lineitem(l_quantity);	3968102.568 ms (01:06:08.103)	2359304.405 ms (39:19.304)	40.54 %	4129510.930 ms (01:08:49.511)	1680201.644 ms (28:00.202)	59.31 %	4348248.210 ms (01:12:28.248)	1490461.879 ms (24:50.462)	65.72 %
index on "character" column: CREATE INDEX li_lnst_idx4 ON lineitem(l_linestatus);	1510273.931 ms (25:10.274)	1240265.301 ms (20:40.265)	17.87 %	1516842.985 ms (25:16.843)	995730.092 ms (16:35.730)	34.35 %	1580789.375 ms (26:20.789)	984975.746 ms (16:24.976)	37.69 %
index on "date" column: CREATE INDEX li_shipdt_idx5 ON lineitem(l_shipdate);	1483603.274 ms (24:43.603)	1189704.930 ms (19:49.705)	19.80 %	1498348.925 ms (24:58.349)	1040421.626 ms (17:20.422)	30.56 %	1653651.499 ms (27:33.651)	1016305.794 ms (16:56.306)	38.54 %
index on "character varying" column: CREATE INDEX li_comment_idx6 ON lineitem(l_comment);	6945953.838 ms (01:55:45.954)	4329696.334 ms (01:12:09.696)	37.66 %	6818556.437 ms (01:53:38.556)	2834034.054 ms (47:14.034)	58.43 %	6942285.711 ms (01:55:42.286)	2648430.902 ms (44:08.431)	61.85 %
composite index on "numeric", "character" columns: CREATE INDEX li_qtylnst_idx34 ON lineitem (l_quantity, l_linestatus);	4961563.400 ms (01:22:41.563)	2959722.178 ms (49:19.722)	40.34 %	5242809.501 ms (01:27:22.810)	2077463.136 ms (34:37.463)	60.37 %	5576765.727 ms (01:32:56.766)	1755829.420 ms (29:15.829)	68.51 %
composite index on "date", "character varying" columns: CREATE INDEX li_shipdtcomment_idx56 ON lineitem (l_shipdate, l_comment);	4693318.077 ms (01:18:13.318)	3181494.454 ms (53:01.494)	32.21 %	4627624.682 ms (01:17:07.625)	2613289.211 ms (43:33.289)	43.52 %	4719242.965 ms (01:18:39.243)	2685516.832 ms (44:45.517)	43.09 %

Thanks & Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984

www.enterprisedb.co m

On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> More comments:

Attached patch has all open issues worked through, including those
that I respond to or comment on below, as well as the other feedback
from your previous e-mails. Note also that I fixed the issue that Amit
raised, as well as the GetOldestXmin()-argument bug that I noticed in
passing when responding to Amit. I also worked on the attribution in
the commit message.

Before getting to my responses to your most recent round of feedback,
I want to first talk about some refactoring that I decided to do. As
you can see from the master branch, tuplesort_performsort() isn't
necessarily reached for spool2, even when we start out with a spool2
(that is, for many unique index builds, spool2 never even does a
tuplesort_performsort()). We may instead decide to shut down spool2
when it has no (dead) tuples. I made this work just as well for the
parallel case in this latest revision. I had to teach tuplesort.c to
accept an early tuplesort_end() for LEADER() -- it had to be prepared
to release still-waiting workers in some cases, rather than depending
on nbtsort.c having called tuplesort_performsort() already. Several
routines within nbtsort.c that previously knew something about
parallelism now know nothing about it. This seems like a nice win.

Separately, I took advantage of the fact that within the leader, its
*worker* Tuplesortstate can safely call tuplesort_end() before the
leader state's tuplesort_performsort() call.

The overall effect of these two changes is that there is now a
_bt_leader_heapscan() call for the parallel case that nicely mirrors
the serial case's IndexBuildHeapScan() call, and once we're done with
populating spools, no subsequent code needs to know a single thing
about parallelism as a special case. You may notice some small changes
to the tuplesort.h overview, which now advertises that callers can
take advantage of this leeway.

Now on to my responses to your most recent round of feeback...

> BufFileView() looks fairly pointless. It basically creates a copy of
> the input and, in so doing, destroys the input, which is a lot like
> returning the input parameter except that it uses more cycles. It
> does do a few things.

While it certainly did occur to me that that was kind of weird, and I
struggled with it on my own for a little while, I ultimately agreed
with Thomas that it added something to have ltsConcatWorkerTapes()
call some buffile function in every iteration of its loop.
(BufFileView() + BufFileViewAppend() are code that Thomas actually
wrote, though I added the asserts and comments myself.)

If you think about this in terms of the interface rather than the
implementation, then it may make more sense. The encapsulation adds
something which might pay off later, such as when extendBufFile()
needs to work with a concatenated set of BufFiles. And even right now,
I cannot simply reuse the BufFile without then losing the assert that
is currently in BufFileViewAppend() (must not have associated shared
fileset assert). So I'd end up asserting less (rather than more) there
if BufFileView() was removed.

It wastes some cycles to not simply use the BufFile directly, but not
terribly many in the grand scheme of things. This happens once per
external sort operation.

> In miscadmin.h, I'd put the prototype for the new GUC next to
> max_worker_processes, not maintenance_work_mem.

But then I'd really have to put it next to max_worker_processes in
globals.c, too. That would mean that it would go under "Primary
determinants of sizes of shared-memory structures" within globals.c,
which seems wrong to me. What do you think?

> The ereport() in index_build will, I think, confuse people when it
> says that there are 0 parallel workers. I suggest splitting this into
> two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
> "building index \"%s\" on table \"%s\" serially" ...) else ereport(...
> "building index \"%s\" on table \"%s\" in parallel with request for %d
> parallel workers" ...).

WFM. I've simply dropped any reference to leader participation in the
messages here, to keep things simple. This seemed okay because the
only thing that affects leader participation is the
parallel_leader_participation GUC, which is under the user's direct
control at all times, and is unlikely to be changed. Those that really
want further detail have trace_sort for that.

> The logic in IndexBuildHeapRangeScan() around need_register_snapshot
> and OldestXmin seems convoluted and not very well-edited to me.

Having revisited it, I now agree that the code added to
IndexBuildHeapRangeScan() was unclear, primarily in that the
need_unregister_snapshot local variable was overloaded in a weird way.

> I suggest that you go back to the original code organization
> and then just insert an additional case for a caller-supplied scan, so
> that the overall flow looks like this:
>
> if (scan != NULL)
> {
> ...
> }
> else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
> {
> ...
> }
> else
> {
> ...
> }

The problem that I see with this alternative flow is that the "if
(scan != NULL)" and the "else if (IsBootstrapProcessingMode() ||
indexInfo->ii_Concurrent)" blocks clearly must contain code for two
distinct, non-overlapping cases, despite the fact those two cases
actually do overlap somewhat. That is, a call to
IndexBuildHeapRangeScan() may have a (parallel) heap scan argument
(control reaches your first code block), or may not (control reaches
your second or third code block). At the same time, a call to
IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX),
or may need an MVCC snapshot (either by registering its own, or using
the parallel one). These two things are orthogonal.

I think I still get the gist of what you're saying, though. I've come
up with a new structure that is a noticeable improvement on what I
had. Importantly, the new structure let me add a number of
parallelism-agnostic asserts that make sure that every ambuild routine
that supports parallelism gets the details right.

> Along with that, I'd change the name of need_register_snapshot to
> need_unregister_snapshot (it's doing both jobs right now) and
> initialize it to false.

Done.

> + * This support code isn't reliable when called from within a parallel
> + * worker process due to the fact that our state isn't propagated. This is
> + * why parallel index builds are disallowed on catalogs. It is possible
> + * that we'll fail to catch an attempted use of a user index undergoing
> + * reindexing due the non-propagation of this state to workers, which is not
> + * ideal, but the problem is not particularly likely to go undetected due to
> + * our not doing better there.
>
> I understand the first two sentences, but I have no idea what the
> third one means, especially the part that says "not particularly
> likely to go undetected due to our not doing better there". It sounds
> scary that something bad is only "not particularly likely to go
> undetected"; don't we need to detect bad things reliably?

The primary point here, that you said you understood, is that we
definitely need to detect when we're reindexing a catalog index within
the backend, so that systable_beginscan() can do the right thing and
not use the index (we also must avoid assertion failures). My solution
to that problem is, of course, to not allow the use of parallel create
index when REINDEXing a system catalog. That seems 100% fine to me.

There is a little bit of ambiguity about other cases, though -- that's
the secondary point I tried to make within that comment block, and the
part that you took issue with. To put this secondary point another
way: It's possible that we'd fail to detect it if someone's comparator
went bananas and decided it was okay to do SQL access (that resulted
in an index scan of the index undergoing reindex). That does seem
rather unlikely, but I felt it necessary to say something like this
because ReindexIsProcessingIndex() isn't already something that only
deals with catalog indexes -- it works with all indexes.

Anyway, I reworded this. I hope that what I came up with is clearer than before.

> But also,
> you used the word "not" three times and also the prefix "un-", meaning
> "not", once. Four negations in 13 words! Perhaps I'm not entirely in
> a position to cast aspersions on overly-complex phraseology -- the pot
> calling the kettle black and all that -- but I bet that will be a lot
> clearer if you reduce the number of negations to either 0 or 1.

You're not wrong. Simplified.

> The comment change in standard_planner() doesn't look helpful to me;
> I'd leave it out.

Okay.

> + * tableOid is the table that index is to be built on. indexOid is the OID
> + * of a index to be created or reindexed (which must be a btree index).
>
> I'd rewrite that first sentence to end "the table on which the index
> is to be built". The second sentence should say "an index" rather
> than "a index".

Okay.

> But, actually, I think we would be better off just ripping
> leaderWorker/leaderParticipates out of this function altogether.
> compute_parallel_worker() is not really under any illusion that it's
> computing a number of *participants*; it's just computing a number of
> *workers*.

That distinction does seem to cause plenty of confusion. While I
accept what you say about compute_parallel_worker(), I still haven't
gone as far as removing the leaderParticipates argument altogether,
because compute_parallel_worker() isn't the only thing that matters
here. (More on that below.)

> I think it's fine for
> parallel_leader_participation=off to simply mean that you get one
> fewer participants. That's actually what would happen with parallel
> query, too. Parallel query would consider
> parallel_leader_participation later, in get_parallel_divisor(), when
> working out the cost of one path vs. another, but it doesn't use it to
> choose the number of workers. So it seems to me that getting rid of
> all of the workerLeader considerations will make it both simpler and
> more consistent with what we do for queries.

I was aware of those details, and figured that parallel query fudges
the compute_parallel_worker() figure's leader participation in some
sense, and that that was what I needed to compensate for. After all,
when parallel_leader_participation=off, having
compute_parallel_worker() return 1 means rather a different thing to
what it means with parallel_leader_participation=on, even though in
general we seem to assume that parallel_leader_participation can only
make a small difference overall.

Here's what I've done based on your feedback: I've changed the header
comments, but stopped leaderParticipates from affecting the
compute_parallel_worker() calculation (so, as I said,
leaderParticipates stays). The leaderParticipates argument continues
to affect these two aspects of plan_create_index_workers()'s return
value:

1. It continues to be used so we have a total number of participants
(not workers) to apply our must-have-32MB-workMem limit on
participants.

Parallel query has no equivalent of this, and it seems warranted. Note
that this limit is no longer applied when parallel_workers storage
param was set, as discussed.

2. I continue to use the leaderParticipates argument to disallow the
case where there is only one CREATE INDEX participant but parallelism
is in use, because, of course, that clearly makes no sense -- we
should just use a serial sort instead.

(It might make sense to allow this if parallel_leader_participation
was *purely* a testing GUC, only for use by by backend hackers, but
AFAICT it isn't.)

The planner can allow a single participant parallel sequential scan
path to be created without worrying about the fact that that doesn't
make much sense, because a plan with only one parallel participant is
always going to cost more than some serial plan (you will only get a 1
participant parallel sequential scan when force_parallel_mode is on).
Obviously plan_create_index_workers() doesn't generate (partial) paths
at all, so I simply have to get the same outcome (avoiding a senseless
1 participant parallel operation) some other way here.

> If you have an idea how to make a better
> cost model than this for CREATE INDEX, I'm willing to consider other
> options. If you don't, or want to propose that as a follow-up patch,
> then I think it's OK to use what you've got here for starters. I just
> don't want it to be more baroque than necessary.

I suspect that the parameters of any cost model for parallel CREATE
INDEX that we're prepared to consider for v11 are: "Use a number of
parallel workers that is one below the number at which the total
duration of the CREATE INDEX either stays the same or goes up".

It's hard to do much better than this within those parameters. I can
see a fairly noticeable benefit to parallelism with 4 parallel workers
and a measly 1MB of maintenance_work_mem (when parallelism is forced)
relative to the serial case with the same amount of memory. At least
on my laptop, it seems to be rather hard to lose relative to a serial
sort when using parallel CREATE INDEX (to be fair, I'm probably
actually using way more memory than 1MB to do this due to FS cache
usage). I can think of a cleverer approach to costing parallel CREATE
INDEX, but it's only cleverer by weighing distributed costs. Not very
relevant, for the time being.

BTW, the 32MB per participant limit within plan_create_index_workers()
was chosen based on the realization that any higher value would make
having a default setting of 2 for max_parallel_maintenance_workers (to
match the max_parallel_workers_per_gather default) pointless when the
default maintenance_work_mem value of 64MB is in use. That's not
terribly scientific, though it at least doesn't come at the expense of
a more scientific idea for a limit like that (I don't actually have
one, you see). I am merely trying to avoid being *gratuitously*
wasteful of shared resources that are difficult to accurately cost in
(e.g., the distributed cost of random I/O to the system as a whole
when we do a parallel index build while ridiculously low on
maintenance_work_mem).

> I think that the naming of the wait events could be improved. Right
> now, they are named by which kind of process does the waiting, but it
> really should be named based on what the thing for which we're
> waiting. I also suggest that we could just write Sort instead of
> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
> ParallelSortWorkersDone and ParallelTuplesortLeader ->
> ParallelSortTapeHandover.

WFM. Also added documentation for the wait events to monitoring.sgml,
which I somehow missed the first time around.

> Not for this patch, but I wonder if it might be a worthwhile future
> optimization to allow workers to return multiple tapes to the leader.
> One doesn't want to go crazy with this, of course. If the worker
> returns 100 tapes, then the leader might get stuck doing multiple
> merge passes, which would be a foolish way to divide up the labor, and
> even if that doesn't happen, Amdahl's law argues for minimizing the
> amount of work that is not done in parallel. Still, what if a worker
> (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth
> merging them so that the leader can do a 5-way merge instead of a
> 15-way merge?

I did think about this myself, or rather I thought specifically about
building a serial/bigserial PK during pg_restore, a case that must be
very common. The worker merges for such an index build will typically
be *completely pointless* when all input runs are in sorted order,
because the merge heap will only need to consult the root of the heap
and its two immediate children throughout (commit 24598337c helped
cases like this enormously). You might as well merge hundreds of runs
in the leader, provided you still have enough memory per tape that you
can get the full benefit of OS readahead (this is not that hard when
you're only going to repeatedly read from the same tape anyway).

I'm not too worried about it, though. The overall picture is still
very positive even in this case. The "extra worker merging" isn't
generally a big proportion of the overall cost, especially there. More
importantly, if I tried to do better, it would be the "quicksort with
spillover" cost model story all over again (remember how tedious that
was?). How hard are we prepared to work to ensure that we get it right
when it comes to skipping worker merging, given that users always pay
some overhead, even when that doesn't happen?

Note also that parallel index builds manage to unfairly *gain*
advantage over serial cases (they have the good variety of dumb luck,
rather than the bad variety) in certain other common cases. This
happens with an *inverse* physical/logical correlation (e.g. a DESC
index builds on a date field). They manage to artificially do better
than theory would predict, simply because a greater number of smaller
quicksorts are much faster during initial run generation, without also
taking a concomitant performance hit at merge time. Thomas showed this
at one point. Note that even that's only true because of the qsort
precheck (what I like to call the "banana skin prone" precheck, that
we added to our qsort implementation in 2006) -- it would be true for
*all* correlations, but that one precheck thing complicates matters.

All of this is a tricky business, and that isn't going to get any easier IMV.

> + * Make sure that the temp file(s) underlying the tape set are created in
> + * suitable temp tablespaces. This is only really needed for serial
> + * sorts.
>
> This comment makes me wonder whether it is "sorta" needed for parallel sorts.

I removed "really". The point of the comment is that we've already set
up temp tablespaces for the shared fileset in the parallel case.
Shared filesets figure out which tablespaces will be used up-front --
see SharedFileSetInit().

> - if (trace_sort)
> + if (trace_sort && !WORKER(state))
>
> I have a feeling we still want to get this output even from workers,
> but maybe I'm missing something.

I updated tuplesort_end() so that trace_sort reports on the end of the
sort, even for worker processes. (We still don't show generic
tuplesort_begin* message for workers, though.)

> + arg5 indicates serial, parallel worker, or parallel leader sort.</entry>
>
> I think it should say what values are used for each case.

I based this on "arg0 indicates heap, index or datum sort", where it's
implied that the values are respective to the order that they appear
in in the sentence (starting from 0). But okay, I'll do it that way
all the same.

> + /* Release worker tuplesorts within leader process as soon as possible */
>
> IIUC, the worker tuplesorts aren't really holding onto much of
> anything in terms of resources. I think it might be better to phrase
> this as /* The sort we just did absorbed the final tapes produced by
> these tuplesorts, which are of no further use. */ or words to that
> effect.

Okay. Done that way.

> Instead of making a special case in CreateParallelContext for
> serializable_okay, maybe index_build should just use SetConfigOption()
> to force the isolation level to READ COMMITTED right after it does
> NewGUCNestLevel(). The change would only be temporary because the
> subsequent call to AtEOXact_GUC() will revert it.

I tried doing it that way, but it doesn't seem workable:

postgres=# begin transaction isolation level serializable ;
BEGIN
postgres=*# reindex index test_unique;
ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
LOCATION: call_string_check_hook, guc.c:9953

Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify
default_transaction_isolation -- not transaction_isolation.

Instead, I added a bit more to comments within
CreateParallelContext(), to justify what I've done along the lines you
went into. Hopefully this works better for you.

> There is *still* more to review here, but my concentration is fading.
> If you could post an updated patch after adjusting for the comments
> above, I think that would be helpful. I'm not totally out of things
> to review that I haven't already looked over once, but I think I'm
> close.

I'm impressed with how quickly you're getting through review of the
patch. Hopefully we can keep that momentum up.

Thanks
--
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

17 January 2018, 16:47:24

On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> More comments:
>
> Attached patch has all open issues worked through, including those
> that I respond to or comment on below, as well as the other feedback
> from your previous e-mails. Note also that I fixed the issue that Amit
> raised,
>

I could still reproduce it. I think the way you have fixed it has a
race condition.  In _bt_parallel_scan_and_sort(), the value of
brokenhotchain is set after you signal the leader that the worker is
done (by incrementing workersFinished). Now, the leader is free to
decide based on the current shared state which can give the wrong
value.  Similarly, I think the value of havedead and reltuples can
also be wrong.

You neither seem to have fixed nor responded to the second problem
mentioned in my email upthread [1].  To reiterate, the problem is that
we can't assume that the workers we have launched will always start
and finish. It is possible that postmaster fails to start the worker
due to fork failure. In such conditions, tuplesort_leader_wait will
hang indefinitely because it will wait for the workersFinished count
to become equal to launched workers (+1, if leader participates) which
will never happen.  Am I missing something due to which this won't be
a problem?

Now, I think one argument is that such a problem can happen in a
parallel query, so it is not the responsibility of this patch to solve
it.  However, we already have a patch (there are some review comments
that needs to be addressed in the proposed patch) to solve it and this
patch is adding a new path in the code which has similar symptoms
which can't be fixed with the already proposed patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BizMyxzFD6k81Deyar35YJ5qdpbRTUp9cQvo%2BniQom7Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

17 January 2018, 20:27:10

On Wed, Jan 17, 2018 at 5:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I could still reproduce it. I think the way you have fixed it has a
> race condition.  In _bt_parallel_scan_and_sort(), the value of
> brokenhotchain is set after you signal the leader that the worker is
> done (by incrementing workersFinished). Now, the leader is free to
> decide based on the current shared state which can give the wrong
> value.  Similarly, I think the value of havedead and reltuples can
> also be wrong.

> You neither seem to have fixed nor responded to the second problem
> mentioned in my email upthread [1].  To reiterate, the problem is that
> we can't assume that the workers we have launched will always start
> and finish. It is possible that postmaster fails to start the worker
> due to fork failure. In such conditions, tuplesort_leader_wait will
> hang indefinitely because it will wait for the workersFinished count
> to become equal to launched workers (+1, if leader participates) which
> will never happen.  Am I missing something due to which this won't be
> a problem?

I think that both problems (the live _bt_parallel_scan_and_sort() bug,
as well as the general issue with needing to account for parallel
worker fork() failure) are likely solvable by not using
tuplesort_leader_wait(), and instead calling
WaitForParallelWorkersToFinish(). Which you suggested already.

Separately, I will need to monitor that bugfix patch, and check its
progress, to make sure that what I add is comparable to what
ultimately gets committed for parallel query.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

17 January 2018, 21:27:59

On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think that both problems (the live _bt_parallel_scan_and_sort() bug,
> as well as the general issue with needing to account for parallel
> worker fork() failure) are likely solvable by not using
> tuplesort_leader_wait(), and instead calling
> WaitForParallelWorkersToFinish(). Which you suggested already.

I'm wondering if this shouldn't instead be handled by using the new
Barrier facilities.  I think it would work like this:

- leader calls BarrierInit(..., 0)
- leader calls BarrierAttach() before starting workers.
- each worker, before reading anything from the parallel scan, calls
BarrierAttach().  if the phase returned is greater than 0, then the
worker arrived at the barrier after all the work was done, and should
exit immediately.
- each worker, after finishing sorting, calls BarrierArriveAndWait().
leader, after sorting, also calls BarrierArriveAndWait().
- when BarrierArriveAndWait() returns in the leader, all workers that
actually started (and did so quickly enough) have arrived at the
barrier.  The leader can now do leader_takeover_tapes, being careful
to adopt only the tapes actually created, since some workers may have
failed to launch or launched only after sorting was already complete.
- meanwhile, the workers again call BarrierArriveAndWait().
- after it's done taking over tapes, the leader calls BarrierDetach(),
releasing the workers.
- the workers call BarrierDetach() and then exit -- or maybe they
don't even really need to detach

So the barrier phase numbers would have the following meanings:

0 - sorting
1 - taking over tapes
2 - done

This could be slightly more elegant if BarrierArriveAndWait() had an
additional argument indicating the phase number for which the backend
could wait, or maybe the number of phases for which it should wait.
Then, the workers could avoid having to call BarrierArriveAndWait()
twice in a row.

While I find the Barrier API slightly confusing -- and I suspect I'm
not entirely alone -- I don't think that's a good excuse for
reinventing the wheel.  The problem of needing to wait for every
process that does A (in this case, read tuples from the scan) to also
do B (in this case, finish sorting those tuples) is a very general one
that is deserving of a general solution.  Unless somebody comes up
with a better plan, Barrier seems to be the way to do that in
PostgreSQL.

I don't think using WaitForParallelWorkersToFinish() is a good idea.
That would require workers to hold onto their tuplesorts until after
losing the ability to send messages to the leader, which doesn't sound
like a very good plan.  We don't want workers to detach from their
error queues until the bitter end, lest errors go unreported.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

17 January 2018, 21:40:14

On Mon, Jan 15, 2018 at 7:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> BufFileView() looks fairly pointless.  It basically creates a copy of
>> the input and, in so doing, destroys the input, which is a lot like
>> returning the input parameter except that it uses more cycles.  It
>> does do a few things.
>
> While it certainly did occur to me that that was kind of weird, and I
> struggled with it on my own for a little while, I ultimately agreed
> with Thomas that it added something to have ltsConcatWorkerTapes()
> call some buffile function in every iteration of its loop.
> (BufFileView() + BufFileViewAppend() are code that Thomas actually
> wrote, though I added the asserts and comments myself.)

Hmm, well, if Thomas contributed code to this patch, then he needs to
be listed as an author.  I went searching for an email on this thread
(or any other) where he posted code for this, thinking that there
might be some discussion explaining the motivation, but I didn't find
any.  I'm still in favor of erasing this distinction.

> If you think about this in terms of the interface rather than the
> implementation, then it may make more sense. The encapsulation adds
> something which might pay off later, such as when extendBufFile()
> needs to work with a concatenated set of BufFiles. And even right now,
> I cannot simply reuse the BufFile without then losing the assert that
> is currently in BufFileViewAppend() (must not have associated shared
> fileset assert). So I'd end up asserting less (rather than more) there
> if BufFileView() was removed.

I would see the encapsulation as having some value if the original
BufFile remained valid and the new view were also valid.  Then the
BufFileView operation is a bit like a copy-on-write filesystem
snapshot: you have the original, which you can do stuff with, and you
have a copy, which can be manipulated independently, but the copying
is cheap.  But here the BufFile gobbles up the original so I don't see
the point.

The  Assert(target->fileset == NULL) that would be lost in
BufFileViewAppend has no value anyway, AFAICS.  There is also
Assert(source->readOnly) given which the presence or absence of the
fileset makes no difference.  And if, as you say, extendBufFile were
eventually made to work here, this Assert would presumably get removed
anyway; I think we'd likely want the additional files to get
associated with the shared file set rather than being locally
temporary files.

> It wastes some cycles to not simply use the BufFile directly, but not
> terribly many in the grand scheme of things. This happens once per
> external sort operation.

I'm not at all concerned about the loss of cycles.  I'm concerned
about making the mechanism more complicated to understand and maintain
for future readers of the code.  When experienced hackers see code
that doesn't seem to accomplish anything, they (or at least I) tend to
assume that there must be a hidden reason for it to be there and spend
time trying to figure out what it is.  If there actually is no hidden
purpose, then that study is a waste of time and we can spare them the
trouble by getting rid of it now.

>> In miscadmin.h, I'd put the prototype for the new GUC next to
>> max_worker_processes, not maintenance_work_mem.
>
> But then I'd really have to put it next to max_worker_processes in
> globals.c, too. That would mean that it would go under "Primary
> determinants of sizes of shared-memory structures" within globals.c,
> which seems wrong to me. What do you think?

OK, that's a fair point.

> I think I still get the gist of what you're saying, though. I've come
> up with a new structure that is a noticeable improvement on what I
> had. Importantly, the new structure let me add a number of
> parallelism-agnostic asserts that make sure that every ambuild routine
> that supports parallelism gets the details right.

Yes, that looks better.  I'm slightly dubious that the new Asserts()
are worthwhile, but I guess it's OK.  But I think it would be better
to ditch the if-statement and do it like this:

Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin)
    : !TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);

Also, I think you've got a little more than you need in terms of
comments.  I would keep the comments for the serial case and parallel
case and drop the earlier one that basically says the same thing:

+     * (Note that parallel case never has us register/unregister snapshot, and
+     * provides appropriate snapshot for us.)

> There is a little bit of ambiguity about other cases, though -- that's
> the secondary point I tried to make within that comment block, and the
> part that you took issue with. To put this secondary point another
> way: It's possible that we'd fail to detect it if someone's comparator
> went bananas and decided it was okay to do SQL access (that resulted
> in an index scan of the index undergoing reindex). That does seem
> rather unlikely, but I felt it necessary to say something like this
> because ReindexIsProcessingIndex() isn't already something that only
> deals with catalog indexes -- it works with all indexes.

I agree that it isn't particularly likely, but if somebody found it
worthwhile to insert guards against those cases, maybe we should
preserve them instead of abandoning them.  It shouldn't be that hard
to propagate those values from the leader to the workers.  The main
difficulty there seems to be that we're creating the parallel context
in nbtsort.c, while the state that would need to be propagated is
private to index.c, but there are several ways to solve that problem.
It looks to me like the most robust approach would be to just make
that part of what parallel.c naturally does.  Patch for that attached.

> Here's what I've done based on your feedback: I've changed the header
> comments, but stopped leaderParticipates from affecting the
> compute_parallel_worker() calculation (so, as I said,
> leaderParticipates stays). The leaderParticipates argument continues
> to affect these two aspects of plan_create_index_workers()'s return
> value:
>
> 1. It continues to be used so we have a total number of participants
> (not workers) to apply our must-have-32MB-workMem limit on
> participants.
>
> Parallel query has no equivalent of this, and it seems warranted. Note
> that this limit is no longer applied when parallel_workers storage
> param was set, as discussed.
>
> 2. I continue to use the leaderParticipates argument to disallow the
> case where there is only one CREATE INDEX participant but parallelism
> is in use, because, of course, that clearly makes no sense -- we
> should just use a serial sort instead.

That's an improvement, but see below.

> (It might make sense to allow this if parallel_leader_participation
> was *purely* a testing GUC, only for use by by backend hackers, but
> AFAICT it isn't.)

As applied to parallel CREATE INDEX, it pretty much is just a testing
GUC, which is why I was skeptical about leaving support for it in the
patch.  There's no anticipated advantage to having the leader not
participate -- unlike for parallel queries, where it is quite possible
that setting parallel_leader_participation=off could be a win, even
generally.  If you just have a Gather over a parallel sequential scan,
it is unlikely that parallel_leader_participation=off will help; it
will most likely hurt, at least up to the point where more
participants become a bad idea in general due to contention.  However,
if you have a complex plan involving fairly-large operations that
cannot be divided up among workers, such as a Parallel Append or a
Hash Join with a big startup cost or a Sort that happens in the worker
or even a parallel Index Scan that takes a long time to advance to the
next page because it has to do I/O, you might leave workers idling
while the leader is trying to "help".  Some users may have workloads
where this is the normal case.  Ideally, the planner would figure out
whether this is likely and tell the leader whether or not to
participate, but we don't know how to figure that out yet.  On the
other hand, for CREATE INDEX, having the leader not participate can't
really improve anything.

In other words, right now, parallel_leader_participation is not
strictly a testing GUC, but if we make CREATE INDEX respect it, then
we're pushing it towards being a GUC that you don't ever want to
enable except for testing.  I'm still not sure that's a very good
idea, but if we're going to do it, then surely we should be
consistent.  It's true that having one worker and no parallel leader
participation can never be better than just having the leader do it,
but it is also true that having two leaders and no parallel leader
participation can never be better than having 1 worker with leader
participation.  I don't see a reason to treat those cases differently.

If we're going to keep parallel_leader_participation support here, I
think the last hunk in config.sgml should read more like this:

Allows the leader process to execute the query plan under
<literal>Gather</literal> and <literal>Gather Merge</literal> nodes
and to participate in parallel index builds.  The default is
<literal>on</literal>.  For queries, setting this value to
<literal>off</literal> reduces the likelihood that workers will become
blocked because the leader is not reading tuples fast enough, but
requires the leader process to wait for worker processes to start up
before the first tuples can be produced.  The degree to which the
leader can help or hinder performance depends on the plan type or
index build strategy, number of workers and query duration.  For index
builds, setting this value to <literal>off</literal> is expected to
reduce performance, but may be useful for testing purposes.

> I suspect that the parameters of any cost model for parallel CREATE
> INDEX that we're prepared to consider for v11 are: "Use a number of
> parallel workers that is one below the number at which the total
> duration of the CREATE INDEX either stays the same or goes up".

That's pretty much the definition of a correct cost model; the trick
is how to implement it without an oracle.

> BTW, the 32MB per participant limit within plan_create_index_workers()
> was chosen based on the realization that any higher value would make
> having a default setting of 2 for max_parallel_maintenance_workers (to
> match the max_parallel_workers_per_gather default) pointless when the
> default maintenance_work_mem value of 64MB is in use. That's not
> terribly scientific, though it at least doesn't come at the expense of
> a more scientific idea for a limit like that (I don't actually have
> one, you see). I am merely trying to avoid being *gratuitously*
> wasteful of shared resources that are difficult to accurately cost in
> (e.g., the distributed cost of random I/O to the system as a whole
> when we do a parallel index build while ridiculously low on
> maintenance_work_mem).

I see.  I think it's a good start.  I wonder in general whether it's
better to add memory or add workers.  In other words, suppose I have a
busy system where my index builds are slow.  Should I try to free up
some memory so that I can raise maintenance_work_mem, or should I try
to free up some CPU resources so I can raise
max_parallel_maintenance_workers?  The answer doubtless depends on the
current values that I have configured for those settings and the type
of data that I'm indexing, as well as how much memory I could free up
how easily and how much CPU I could free up how easily.  But I wish I
understood better than I do which one was more likely to help in a
given situation.

I also wonder what the next steps would be to make this whole thing
scale better.  From the performance tests that have been performed so
far, it seems like adding a modest number of workers definitely helps,
but it tops out around 2-3x with 4-8 workers.  I understand from your
previous comments that's typical of other databases.  It also seems
pretty clear that more memory helps but only to a point.  For
instance, I just tried "create index x on pgbench_accounts (aid)"
without your patch at scale factor 1000.  With maintenance_work_mem =
1MB, it generated 6689 runs and took 131 seconds.  With
maintenance_work_mem = 64MB, it took 67 seconds.  With
maintenance_work_mem = 1GB, it took 60 seconds.  More memory didn't
help, even if the sort could be made entirely internal. This seems to
be a fairly typical pattern: using enough memory can buy you a small
multiple, using a bunch of workers can buy you a small multiple, but
then it just doesn't get faster.  Yet, in theory, it seems like if
we're willing to provide essentially unlimited memory and CPU
resources, we ought to be able to make this go almost arbitrarily
fast.

>> I think that the naming of the wait events could be improved.  Right
>> now, they are named by which kind of process does the waiting, but it
>> really should be named based on what the thing for which we're
>> waiting.  I also suggest that we could just write Sort instead of
>> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
>> ParallelSortWorkersDone and ParallelTuplesortLeader ->
>> ParallelSortTapeHandover.
>
> WFM. Also added documentation for the wait events to monitoring.sgml,
> which I somehow missed the first time around.

But you forgot to update the preceding "morerows" line, so the
formatting will be all messed up.

>> +     * Make sure that the temp file(s) underlying the tape set are created in
>> +     * suitable temp tablespaces.  This is only really needed for serial
>> +     * sorts.
>>
>> This comment makes me wonder whether it is "sorta" needed for parallel sorts.
>
> I removed "really". The point of the comment is that we've already set
> up temp tablespaces for the shared fileset in the parallel case.
> Shared filesets figure out which tablespaces will be used up-front --
> see SharedFileSetInit().

So why not say it that way?  i.e. For parallel sorts, this should have
been done already, but it doesn't matter if it gets done twice.

 > I updated tuplesort_end() so that trace_sort reports on the end of the
> sort, even for worker processes. (We still don't show generic
> tuplesort_begin* message for workers, though.)

I don't see any reason not to make those contingent only on
trace_sort.  The user can puzzle apart which messages are which from
the PIDs in the logfile.

>> Instead of making a special case in CreateParallelContext for
>> serializable_okay, maybe index_build should just use SetConfigOption()
>> to force the isolation level to READ COMMITTED right after it does
>> NewGUCNestLevel().  The change would only be temporary because the
>> subsequent call to AtEOXact_GUC() will revert it.
>
> I tried doing it that way, but it doesn't seem workable:
>
> postgres=# begin transaction isolation level serializable ;
> BEGIN
> postgres=*# reindex index test_unique;
> ERROR:  25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
> LOCATION:  call_string_check_hook, guc.c:9953

Bummer.

> Instead, I added a bit more to comments within
> CreateParallelContext(), to justify what I've done along the lines you
> went into. Hopefully this works better for you.

Yeah, that seems OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

propagate-reindex-state-v1.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

18 January 2018, 03:00:49

On Wed, Jan 17, 2018 at 10:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I think that both problems (the live _bt_parallel_scan_and_sort() bug,
>> as well as the general issue with needing to account for parallel
>> worker fork() failure) are likely solvable by not using
>> tuplesort_leader_wait(), and instead calling
>> WaitForParallelWorkersToFinish(). Which you suggested already.
>
> I'm wondering if this shouldn't instead be handled by using the new
> Barrier facilities.

> While I find the Barrier API slightly confusing -- and I suspect I'm
> not entirely alone -- I don't think that's a good excuse for
> reinventing the wheel.  The problem of needing to wait for every
> process that does A (in this case, read tuples from the scan) to also
> do B (in this case, finish sorting those tuples) is a very general one
> that is deserving of a general solution.  Unless somebody comes up
> with a better plan, Barrier seems to be the way to do that in
> PostgreSQL.
>
> I don't think using WaitForParallelWorkersToFinish() is a good idea.
> That would require workers to hold onto their tuplesorts until after
> losing the ability to send messages to the leader, which doesn't sound
> like a very good plan.  We don't want workers to detach from their
> error queues until the bitter end, lest errors go unreported.

What you say here sounds convincing to me. I actually brought up the
idea of using the barrier abstraction a little over a month ago. I was
discouraged by a complicated sounding issue raised by Thomas [1]. At
the time, I figured that the barrier abstraction was a nice to have,
but not really essential. That idea doesn't hold up under scrutiny. I
need to be able to use barriers.

There seems to be some yak shaving involved in getting the barrier
abstraction to do exactly what is required, as Thomas went into at the
time. How should that prerequisite work be structured? For example,
should a patch be spun off for that part?

I may not be the most qualified person for this job, since Thomas
considered two alternative approaches (to making the static barrier
abstraction forget about never-launched participants) without ever
settling on one of them.

[1] https://postgr.es/m/CAEepm=03YnefpCeB=Z67HtQAOEMuhKGyPCY_S1TeH=9a2Rr0LQ@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

18 January 2018, 05:20:37

On Wed, Jan 17, 2018 at 7:00 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> There seems to be some yak shaving involved in getting the barrier
> abstraction to do exactly what is required, as Thomas went into at the
> time. How should that prerequisite work be structured? For example,
> should a patch be spun off for that part?
>
> I may not be the most qualified person for this job, since Thomas
> considered two alternative approaches (to making the static barrier
> abstraction forget about never-launched participants) without ever
> settling on one of them.

I had forgotten about the previous discussion.  The sketch in my
previous email supposed that we would use dynamic barriers since the
whole point, after all, is to handle the fact that we don't know how
many participants will really show up.  Thomas's idea seems to be that
the leader will initialize the barrier based on the anticipated number
of participants and then tell it to forget about the participants that
don't materialize.  Of course, that would require that the leader
somehow figure out how many participants didn't show up so that it can
deduct then from the counter in the barrier.  And how is it going to
do that?

It's true that the leader will know the value of nworkers_launched,
but as the comment in LaunchParallelWorkers() says: "The caller must
be able to tolerate ending up with fewer workers than expected, so
there is no need to throw an error here if registration fails.  It
wouldn't help much anyway, because registering the worker in no way
guarantees that it will start up and initialize successfully."  So it
seems to me that a much better plan than having the leader try to
figure out how many workers failed to launch would be to just keep a
count of how many workers did in fact launch.  The count can be stored
in shared memory, and each worker that comes along can increment it.
Then we don't have to worry about whether we accurately detect failure
to launch.  We can argue about whether it's possible to detect all
cases of failure to launch unerringly, but what's for sure is that if
a worker increments a counter in shared memory, it launched.  Now,
where should this counter be located?  There are of course multiple
possibilities, but in my sketch it goes in
some_barrier_variable->nparticipants i.e. we just use a dynamic
barrier.

So my position (at least until Thomas or Andres shows up and tells me
why I'm wrong) is that you can use the Barrier API just as it is
without any yak-shaving, just by following the sketch I set out
before.  The additional API I proposed in that sketch isn't really
required, although it might be more efficient.  But it doesn't really
matter: if that comes along later, it will be trivial to adjust the
code to take advantage of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

18 January 2018, 06:22:10

On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> While it certainly did occur to me that that was kind of weird, and I
>> struggled with it on my own for a little while, I ultimately agreed
>> with Thomas that it added something to have ltsConcatWorkerTapes()
>> call some buffile function in every iteration of its loop.
>> (BufFileView() + BufFileViewAppend() are code that Thomas actually
>> wrote, though I added the asserts and comments myself.)
>
> Hmm, well, if Thomas contributed code to this patch, then he needs to
> be listed as an author.  I went searching for an email on this thread
> (or any other) where he posted code for this, thinking that there
> might be some discussion explaining the motivation, but I didn't find
> any.  I'm still in favor of erasing this distinction.

I cleared this with Thomas recently, on this very thread, and got a +1
from him on not listing him as an author. Still, I have no problem
crediting Thomas as an author instead of a reviewer, even though
you're now asking me to remove what little code he actually authored.
The distinction between secondary author and reviewer is often
blurred, anyway.

Whether or not Thomas is formally a co-author is ambiguous, and not
something that I feel strongly about (there is no ambiguity about the
fact that he made a very useful contribution, though -- he certainly
did, both directly and indirectly). I already went out of my way to
ensure that Heikki receives a credit for parallel CREATE INDEX in the
v11 release notes, even though I don't think that there is any formal
rule requiring me to do so -- he *didn't* write even one line of code
in this patch. (That was just my take on another ambiguous question
about authorship.)

I suggest that we revisit this when you're just about to commit the
patch. Or you can just add his name -- I like to err on the side of
being inclusive.

>> If you think about this in terms of the interface rather than the
>> implementation, then it may make more sense. The encapsulation adds
>> something which might pay off later, such as when extendBufFile()
>> needs to work with a concatenated set of BufFiles. And even right now,
>> I cannot simply reuse the BufFile without then losing the assert that
>> is currently in BufFileViewAppend() (must not have associated shared
>> fileset assert). So I'd end up asserting less (rather than more) there
>> if BufFileView() was removed.
>
> I would see the encapsulation as having some value if the original
> BufFile remained valid and the new view were also valid.  Then the
> BufFileView operation is a bit like a copy-on-write filesystem
> snapshot: you have the original, which you can do stuff with, and you
> have a copy, which can be manipulated independently, but the copying
> is cheap.  But here the BufFile gobbles up the original so I don't see
> the point.

I'll see what I can do about this.

>> I think I still get the gist of what you're saying, though. I've come
>> up with a new structure that is a noticeable improvement on what I
>> had. Importantly, the new structure let me add a number of
>> parallelism-agnostic asserts that make sure that every ambuild routine
>> that supports parallelism gets the details right.
>
> Yes, that looks better.  I'm slightly dubious that the new Asserts()
> are worthwhile, but I guess it's OK.

Bear in mind that the asserts basically amount to a check that the am
propagated indexInfo->ii_Concurrent correct within workers. It's nice
to be able to do this in a way that applies equally well to the serial
case.

> But I think it would be better
> to ditch the if-statement and do it like this:
>
> Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
> Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin)
>     : !TransactionIdIsValid(OldestXmin));
> Assert(snapshot == SnapshotAny || !anyvisible);
>
> Also, I think you've got a little more than you need in terms of
> comments.  I would keep the comments for the serial case and parallel
> case and drop the earlier one that basically says the same thing:

Okay.

>> (ReindexIsProcessingIndex() issue with non-catalog tables)
>
> I agree that it isn't particularly likely, but if somebody found it
> worthwhile to insert guards against those cases, maybe we should
> preserve them instead of abandoning them.  It shouldn't be that hard
> to propagate those values from the leader to the workers.  The main
> difficulty there seems to be that we're creating the parallel context
> in nbtsort.c, while the state that would need to be propagated is
> private to index.c, but there are several ways to solve that problem.
> It looks to me like the most robust approach would be to just make
> that part of what parallel.c naturally does.  Patch for that attached.

If you think it's worth the cycles, then I have no objection. I will
point out that this means that everything that I say about
ReindexIsProcessingIndex() no longer applies, because the relevant
state will now be propagated. It doesn't need to be mentioned at all,
and I don't even need to forbid builds on catalogs.

Should I go ahead and restore builds on catalogs, and remove those
comments, on the assumption that your patch will be committed before
mine? Obviously parallel index builds on catalogs don't matter. OTOH,
why not? Perhaps it's like the debate around HOT that took place over
10 years ago, where Tom insisted that HOT work with catalogs on
general principle.

>> (It might make sense to allow this if parallel_leader_participation
>> was *purely* a testing GUC, only for use by by backend hackers, but
>> AFAICT it isn't.)
>
> As applied to parallel CREATE INDEX, it pretty much is just a testing
> GUC, which is why I was skeptical about leaving support for it in the
> patch.  There's no anticipated advantage to having the leader not
> participate -- unlike for parallel queries, where it is quite possible
> that setting parallel_leader_participation=off could be a win, even
> generally.  If you just have a Gather over a parallel sequential scan,
> it is unlikely that parallel_leader_participation=off will help; it
> will most likely hurt, at least up to the point where more
> participants become a bad idea in general due to contention.

It's unlikely to hurt much, since as you yourself said,
compute_parallel_worker() doesn't consider the leader's participation.
Actually, if we assume that compute_parallel_worker() is perfect, then
surely parallel_leader_participation=off would beat
parallel_leader_participation=on for CREATE INDEX -- it would allow us
to use the value that compute_parallel_worker() truly intended. Which
is the opposite of what you say about
parallel_leader_participation=off above.

I am only trying to understand your perspective here. I don't think
that parallel_leader_participation support is that important. I think
that parallel_leader_participation=off might be slightly useful as a
way of discouraging parallel CREATE INDEX on smaller tables, just like
it is for parallel sequential scan (though this hinges on specifically
disallowing "degenerate parallel scan" cases). More often, it will
make hardly any difference if parallel_leader_participation is on or
off.

> In other words, right now, parallel_leader_participation is not
> strictly a testing GUC, but if we make CREATE INDEX respect it, then
> we're pushing it towards being a GUC that you don't ever want to
> enable except for testing.  I'm still not sure that's a very good
> idea, but if we're going to do it, then surely we should be
> consistent.

I'm confused. I *don't* want it to be something that you can only use
for testing. I want to not hurt whatever case there is for the
parallel_leader_participation GUC being something that a DBA may tune
in production. I don't see the conflict here.

> It's true that having one worker and no parallel leader
> participation can never be better than just having the leader do it,
> but it is also true that having two leaders and no parallel leader
> participation can never be better than having 1 worker with leader
> participation.  I don't see a reason to treat those cases differently.

You must mean "having two workers and no parallel leader participation...".

The reason to treat those two cases differently is simple: One
couldn't possibly be desirable in production, and undermines the whole
idea of parallel_leader_participation being user visible by adding a
sharp edge. The other is likely to be pretty harmless, especially
because leader participation is generally pretty fudged, and our cost
model is fairly rough. The difference here isn't what is important;
avoiding doing something that we know couldn't possibly help under any
circumstances is important. I think that we should do that on general
principle.

As I said in a prior e-mail, even parallel query's use of
parallel_leader_participation is consistent with what I propose here,
practically speaking, because a partial path without leader
participation will always lose to a serial sequential scan path in
practice. The fact that the optimizer will create a partial path that
makes a useless "degenerate parallel scan" a *theoretical* possibility
is irrelevant, because the optimizer has its own way of making sure
that such a plan doesn't actually get picked. It has its way, and so I
must have my own.

> If we're going to keep parallel_leader_participation support here, I
> think the last hunk in config.sgml should read more like this:
>
> Allows the leader process to execute the query plan under
> <literal>Gather</literal> and <literal>Gather Merge</literal> nodes
> and to participate in parallel index builds.  The default is
> <literal>on</literal>.  For queries, setting this value to
> <literal>off</literal> reduces the likelihood that workers will become
> blocked because the leader is not reading tuples fast enough, but
> requires the leader process to wait for worker processes to start up
> before the first tuples can be produced.  The degree to which the
> leader can help or hinder performance depends on the plan type or
> index build strategy, number of workers and query duration.  For index
> builds, setting this value to <literal>off</literal> is expected to
> reduce performance, but may be useful for testing purposes.

Why is CREATE INDEX really that different in terms of the downside for
production DBAs? I think it's more accurate to say that it's not
expected to improve performance. What do you think?

>> I suspect that the parameters of any cost model for parallel CREATE
>> INDEX that we're prepared to consider for v11 are: "Use a number of
>> parallel workers that is one below the number at which the total
>> duration of the CREATE INDEX either stays the same or goes up".
>
> That's pretty much the definition of a correct cost model; the trick
> is how to implement it without an oracle.

Correct on its own terms, at least. What I meant to convey here is
that there is little scope to do better in v11 on distributed costs
for the system as a whole, and therefore little scope to improve the
cost model.

>> BTW, the 32MB per participant limit within plan_create_index_workers()
>> was chosen based on the realization that any higher value would make
>> having a default setting of 2 for max_parallel_maintenance_workers (to
>> match the max_parallel_workers_per_gather default) pointless when the
>> default maintenance_work_mem value of 64MB is in use.

> I see.  I think it's a good start.  I wonder in general whether it's
> better to add memory or add workers.  In other words, suppose I have a
> busy system where my index builds are slow.  Should I try to free up
> some memory so that I can raise maintenance_work_mem, or should I try
> to free up some CPU resources so I can raise
> max_parallel_maintenance_workers?

This is actually all about distributed costs, I think. Provided you
have a reasonably sympathetic index build, like say a random numeric
column index build, and the index won't be multiple gigabytes in size,
then 1MB of maintenance_work_mem still seems to win with parallelism.
This seems extremely "selfish", though -- that's going to incur a lot
of random I/O for an operation that is typically characterized by
sequential I/O. Plus, I bet you're using quite a bit more memory than
1MB, in the form of FS cache. It seems hard to lose if you don't care
about distributed costs, especially if it's a matter of using 1 or 2
parallel workers versus just doing a serial build. Granted, you go
into a 1MB of maintenance_work_mem case below where parallelism loses,
which seems to contradict my suggestion that you practically cannot
lose with parallelism. However, ISTM that you really went out of your
way to find a case that lost.

Of course, I'm not arguing that it's okay for parallel CREATE INDEX to
be selfish -- it isn't. I'm prepared to say that you shouldn't use
parallelism if you have 1MB of maintenance_work_mem, no matter how
much it seems to help (and though it might sound crazy, because it is,
it *can* help). I'm just surprised that you've not said a lot more
about distributed costs, because that's where all the potential
benefit seems to be. It happens to be an area that we have no history
of modelling in any way, which makes it hard, but that's the situation
we seem to be in.

> I also wonder what the next steps would be to make this whole thing
> scale better.  From the performance tests that have been performed so
> far, it seems like adding a modest number of workers definitely helps,
> but it tops out around 2-3x with 4-8 workers.  I understand from your
> previous comments that's typical of other databases.

Yes. This patch seems to have scalability that is very similar to the
scalability that you get with similar features in other systems. I
have not verified this through first hand experience/experiments,
because I don't have access to that stuff. But I have found numerous
reports related to more than one other system. I don't think that this
is the only goal that matters, but I do think that it's an interesting
data point.

> It also seems
> pretty clear that more memory helps but only to a point.  For
> instance, I just tried "create index x on pgbench_accounts (aid)"
> without your patch at scale factor 1000.

Again, I discourage everyone from giving too much weight to index
builds like this one. This does not involve very much sorting at all,
because everything is already in order, and the comparisons are cheap
int4 comparisons. It may already be very I/O bound before you start to
use parallelism.

> With maintenance_work_mem =
> 1MB, it generated 6689 runs and took 131 seconds.  With
> maintenance_work_mem = 64MB, it took 67 seconds.  With
> maintenance_work_mem = 1GB, it took 60 seconds.  More memory didn't
> help, even if the sort could be made entirely internal. This seems to
> be a fairly typical pattern: using enough memory can buy you a small
> multiple, using a bunch of workers can buy you a small multiple, but
> then it just doesn't get faster.

Adding memory is just as likely to hurt slightly as help slightly,
especially if you're talking about CREATE INDEX, where being able to
use a final on-the-fly merge is a big deal (you can hide the cost of
the merging by doing it when you're very I/O bound anyway). This
should be true with only modest assumptions: I assume that you're in
one pass territory here, and that you have a reasonably small merge
heap (with perhaps no more than 100 runs). This seems likely to be
true the vast majority of the time with CREATE INDEX, assuming the
system is reasonably well configured. Roughly speaking, once my
assumptions are met, the exact number of runs almost doesn't matter
(that's at least useful as a mental model).

I basically disagree with the statement "using enough memory can buy
you a small multiple", since it's only true when you started out using
an unreasonably small amount of memory. Bear in mind that every time
maintenance_work_mem is doubled, our capacity to do sorts in one pass
quadruples. Using 1MB of maintenance_work_mem just doesn't make sense
*economically*, unless, perhaps, you care about neither the duration
of the CREATE INDEX statement, nor your electricity bill. You cannot
extrapolate anything useful from an index build that uses only 1MB of
maintenance_work_mem for all kinds of reasons.

I suggest taking another look at Prabhat's results. Here are my
observations about them:

* For serial sorts, a person reading his results could be forgiven for
thinking that increasing the amount of memory for a sort makes it go
*slower*, at least by a bit.

* Sometimes that doesn't happen for serial sorts, and sometimes it
does happen for parallel sorts, but mostly it hurts serial sorts and
helps parallel sorts, since Prabhat didn't start with an unreasonable
low amount of maintenance_work_mem.

* All the indexes are built against the same table. For the serial
cases, among each index that was built, the longest build took about
6x more time than the shortest. For parallel builds, it's more like a
3x difference. The difference gets smaller when you eliminate cases
that actually have to do almost no sorting. This "3x vs. 6x"
difference matters a lot.

This suggests to me that parallel CREATE INDEX has proven itself as
something that can take a mostly CPU bound index build, and make it
into a mostly I/O bound index build. It also suggests that we can make
better use of memory with parallel CREATE INDEX only because workers
will still need to get a reasonable amount of memory. You definitely
don't want multiple passes in workers, but for the same reasons that
you don't want them in serial cases.

> Yet, in theory, it seems like if
> we're willing to provide essentially unlimited memory and CPU
> resources, we ought to be able to make this go almost arbitrarily
> fast.

The main reason that the scalability of CREATE INDEX has trouble
getting past about 3.5x in cases we've seen doesn't involve any
scalability theory: we're very much I/O bound during the merge,
because we have to actually write out the index, regardless of what
tuplesort does or doesn't do. I've seen over 4x improvements on
systems that have sufficient temp file sequential I/O bandwidth, and
reasonably sympathetic data distributions/types.

>> WFM. Also added documentation for the wait events to monitoring.sgml,
>> which I somehow missed the first time around.
>
> But you forgot to update the preceding "morerows" line, so the
> formatting will be all messed up.

Fixed.

>> I removed "really". The point of the comment is that we've already set
>> up temp tablespaces for the shared fileset in the parallel case.
>> Shared filesets figure out which tablespaces will be used up-front --
>> see SharedFileSetInit().
>
> So why not say it that way?  i.e. For parallel sorts, this should have
> been done already, but it doesn't matter if it gets done twice.

Okay.

> I don't see any reason not to make those contingent only on
> trace_sort.  The user can puzzle apart which messages are which from
> the PIDs in the logfile.

Okay. I have removed anything that restrains the verbosity of
trace_sort for the WORKER() case. I think that you were right about it
the first time, but I now think that this is going too far. I'm
letting it go, though.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

18 January 2018, 06:28:05

On Wed, Jan 17, 2018 at 6:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I had forgotten about the previous discussion.  The sketch in my
> previous email supposed that we would use dynamic barriers since the
> whole point, after all, is to handle the fact that we don't know how
> many participants will really show up.  Thomas's idea seems to be that
> the leader will initialize the barrier based on the anticipated number
> of participants and then tell it to forget about the participants that
> don't materialize.  Of course, that would require that the leader
> somehow figure out how many participants didn't show up so that it can
> deduct then from the counter in the barrier.  And how is it going to
> do that?

I don't know; Thomas?

> It's true that the leader will know the value of nworkers_launched,
> but as the comment in LaunchParallelWorkers() says: "The caller must
> be able to tolerate ending up with fewer workers than expected, so
> there is no need to throw an error here if registration fails.  It
> wouldn't help much anyway, because registering the worker in no way
> guarantees that it will start up and initialize successfully."  So it
> seems to me that a much better plan than having the leader try to
> figure out how many workers failed to launch would be to just keep a
> count of how many workers did in fact launch.

> So my position (at least until Thomas or Andres shows up and tells me
> why I'm wrong) is that you can use the Barrier API just as it is
> without any yak-shaving, just by following the sketch I set out
> before.  The additional API I proposed in that sketch isn't really
> required, although it might be more efficient.  But it doesn't really
> matter: if that comes along later, it will be trivial to adjust the
> code to take advantage of it.

Okay. I'll work on adopting dynamic barriers in the way you described.
I just wanted to make sure that we're all on the same page about what
that looks like.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

18 January 2018, 13:49:51

Hi,

I'm mostly away from my computer this week -- sorry about that, but
here are a couple of quick answers to questions directed at me:

On Thu, Jan 18, 2018 at 4:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> While it certainly did occur to me that that was kind of weird, and I
>>> struggled with it on my own for a little while, I ultimately agreed
>>> with Thomas that it added something to have ltsConcatWorkerTapes()
>>> call some buffile function in every iteration of its loop.
>>> (BufFileView() + BufFileViewAppend() are code that Thomas actually
>>> wrote, though I added the asserts and comments myself.)
>>
>> Hmm, well, if Thomas contributed code to this patch, then he needs to
>> be listed as an author.  I went searching for an email on this thread
>> (or any other) where he posted code for this, thinking that there
>> might be some discussion explaining the motivation, but I didn't find
>> any.  I'm still in favor of erasing this distinction.
>
> I cleared this with Thomas recently, on this very thread, and got a +1
> from him on not listing him as an author. Still, I have no problem
> crediting Thomas as an author instead of a reviewer, even though
> you're now asking me to remove what little code he actually authored.
> The distinction between secondary author and reviewer is often
> blurred, anyway.

The confusion comes about because I gave some small code fragments to
Rushabh for the BufFileView stuff off-list, when suggesting ideas for
how to integrate Peter's patch with some ancestor of my SharedFileSet
patch.  It was just a sketch and whether or not any traces remain in
the final commit, please credit me as a reviewer.  I need to review
more patches!  /me ducks

No objections from me if you hate the "view" idea or implementation
and think it's better to make a destructive append-BufFile-to-BufFile
operation instead.

On Thu, Jan 18, 2018 at 4:28 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 17, 2018 at 6:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I had forgotten about the previous discussion.  The sketch in my
>> previous email supposed that we would use dynamic barriers since the
>> whole point, after all, is to handle the fact that we don't know how
>> many participants will really show up.  Thomas's idea seems to be that
>> the leader will initialize the barrier based on the anticipated number
>> of participants and then tell it to forget about the participants that
>> don't materialize.  Of course, that would require that the leader
>> somehow figure out how many participants didn't show up so that it can
>> deduct then from the counter in the barrier.  And how is it going to
>> do that?
>
> I don't know; Thomas?

The idea I mentioned would only work if nworkers_launched is never
over-reported in a scenario that doesn't error out or crash, and never
under-reported in any scenario.  Otherwise static barriers may be even
less useful than I thought.

>> It's true that the leader will know the value of nworkers_launched,
>> but as the comment in LaunchParallelWorkers() says: "The caller must
>> be able to tolerate ending up with fewer workers than expected, so
>> there is no need to throw an error here if registration fails.  It
>> wouldn't help much anyway, because registering the worker in no way
>> guarantees that it will start up and initialize successfully."  So it
>> seems to me that a much better plan than having the leader try to
>> figure out how many workers failed to launch would be to just keep a
>> count of how many workers did in fact launch.

(If nworkers_launched can be silently over-reported, then does
parallel_leader_participation = off have a bug?  If no workers really
launched and reached the main executor loop but nworkers_launched > 0,
then no one is running the plan.)

>> So my position (at least until Thomas or Andres shows up and tells me
>> why I'm wrong) is that you can use the Barrier API just as it is
>> without any yak-shaving, just by following the sketch I set out
>> before.  The additional API I proposed in that sketch isn't really
>> required, although it might be more efficient.  But it doesn't really
>> matter: if that comes along later, it will be trivial to adjust the
>> code to take advantage of it.

Yeah, the dynamic Barrier API was intended for things like this.  I
was only trying to provide a simpler-to-use alternative that I thought
might work for this particular case (but not executor nodes, which
have another source of uncertainty about party size).  It sounds like
it's not actually workable though, and the dynamic API may be the only
way.  So the patch would have to deal with explicit phases.

> Okay. I'll work on adopting dynamic barriers in the way you described.
> I just wanted to make sure that we're all on the same page about what
> that looks like.

Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach
and then just exit, you'll need to detach from the barrier both in the
case where the worker exits early because the phase is too high and
the case where you attach in in time to help and run to completion;
(2) maybe workers could use BarrierArriveAndDetach() at the end (the
leader needs to use BarrierArriveAndWait(), but the workers don't
really need to wait for each other before they exit, do they?); (3)
erm, maybe it's a problem that errors occurring in workers while the
leader is waiting at a barrier won't unblock the leader (we don't
detach from barriers on abort/exit) -- I'll look into this.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

18 January 2018, 15:35:04

On Thu, Jan 18, 2018 at 4:19 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hi,
>
> I'm mostly away from my computer this week -- sorry about that, but
> here are a couple of quick answers to questions directed at me:
>
> On Thu, Jan 18, 2018 at 4:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>> It's true that the leader will know the value of nworkers_launched,
>>> but as the comment in LaunchParallelWorkers() says: "The caller must
>>> be able to tolerate ending up with fewer workers than expected, so
>>> there is no need to throw an error here if registration fails.  It
>>> wouldn't help much anyway, because registering the worker in no way
>>> guarantees that it will start up and initialize successfully."  So it
>>> seems to me that a much better plan than having the leader try to
>>> figure out how many workers failed to launch would be to just keep a
>>> count of how many workers did in fact launch.
>
> (If nworkers_launched can be silently over-reported, then does
> parallel_leader_participation = off have a bug?
>

Yes, and it is being discussed in CF entry [1].

[1] - https://commitfest.postgresql.org/16/1341/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

18 January 2018, 17:14:55

On Thu, Jan 18, 2018 at 8:52 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>> (It might make sense to allow this if parallel_leader_participation
>>> was *purely* a testing GUC, only for use by by backend hackers, but
>>> AFAICT it isn't.)
>>
>> As applied to parallel CREATE INDEX, it pretty much is just a testing
>> GUC, which is why I was skeptical about leaving support for it in the
>> patch.  There's no anticipated advantage to having the leader not
>> participate -- unlike for parallel queries, where it is quite possible
>> that setting parallel_leader_participation=off could be a win, even
>> generally.  If you just have a Gather over a parallel sequential scan,
>> it is unlikely that parallel_leader_participation=off will help; it
>> will most likely hurt, at least up to the point where more
>> participants become a bad idea in general due to contention.
>
> It's unlikely to hurt much, since as you yourself said,
> compute_parallel_worker() doesn't consider the leader's participation.
> Actually, if we assume that compute_parallel_worker() is perfect, then
> surely parallel_leader_participation=off would beat
> parallel_leader_participation=on for CREATE INDEX -- it would allow us
> to use the value that compute_parallel_worker() truly intended. Which
> is the opposite of what you say about
> parallel_leader_participation=off above.
>
> I am only trying to understand your perspective here. I don't think
> that parallel_leader_participation support is that important. I think
> that parallel_leader_participation=off might be slightly useful as a
> way of discouraging parallel CREATE INDEX on smaller tables, just like
> it is for parallel sequential scan (though this hinges on specifically
> disallowing "degenerate parallel scan" cases). More often, it will
> make hardly any difference if parallel_leader_participation is on or
> off.
>
>> In other words, right now, parallel_leader_participation is not
>> strictly a testing GUC, but if we make CREATE INDEX respect it, then
>> we're pushing it towards being a GUC that you don't ever want to
>> enable except for testing.  I'm still not sure that's a very good
>> idea, but if we're going to do it, then surely we should be
>> consistent.
>

I see your point.  OTOH, I think we should have something for testing
purpose as that helps in catching the bugs and makes it easy to write
tests that cover worker part of the code.

>
> I'm confused. I *don't* want it to be something that you can only use
> for testing. I want to not hurt whatever case there is for the
> parallel_leader_participation GUC being something that a DBA may tune
> in production. I don't see the conflict here.
>
>> It's true that having one worker and no parallel leader
>> participation can never be better than just having the leader do it,
>> but it is also true that having two leaders and no parallel leader
>> participation can never be better than having 1 worker with leader
>> participation.  I don't see a reason to treat those cases differently.
>
> You must mean "having two workers and no parallel leader participation...".
>
> The reason to treat those two cases differently is simple: One
> couldn't possibly be desirable in production, and undermines the whole
> idea of parallel_leader_participation being user visible by adding a
> sharp edge. The other is likely to be pretty harmless, especially
> because leader participation is generally pretty fudged, and our cost
> model is fairly rough. The difference here isn't what is important;
> avoiding doing something that we know couldn't possibly help under any
> circumstances is important. I think that we should do that on general
> principle.
>
> As I said in a prior e-mail, even parallel query's use of
> parallel_leader_participation is consistent with what I propose here,
> practically speaking, because a partial path without leader
> participation will always lose to a serial sequential scan path in
> practice. The fact that the optimizer will create a partial path that
> makes a useless "degenerate parallel scan" a *theoretical* possibility
> is irrelevant, because the optimizer has its own way of making sure
> that such a plan doesn't actually get picked. It has its way, and so I
> must have my own.
>

Can you please elaborate what part of optimizer are you talking about
where without leader participation partial path will always lose to a
serial sequential scan path?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

18 January 2018, 17:21:46

On Wed, Jan 17, 2018 at 10:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> As I said in a prior e-mail, even parallel query's use of
> parallel_leader_participation is consistent with what I propose here,
> practically speaking, because a partial path without leader
> participation will always lose to a serial sequential scan path in
> practice.

Amit's reply to this part drew my attention to it.  I think this is
entirely false.  Consider an aggregate that doesn't support partial
aggregation, and a plan that looks like this:

Aggregate
-> Gather
  -> Parallel Seq Scan
    Filter: something fairly selective

It is quite possible for this to be superior to a non-parallel plan
even with only 1 worker and no parallel leader participation.  The
worker can evaluate the filter qual, and the leader can evaluate the
aggregate.  If the CPU costs of doing those computations are high
enough to outweigh the costs of shuffling tuples between backends, we
win.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

18 January 2018, 20:22:59

On Thu, Jan 18, 2018 at 5:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I'm mostly away from my computer this week -- sorry about that,

Yeah, seriously.  Since when it is OK for hackers to ever be away from
their computers?  :-)

> The idea I mentioned would only work if nworkers_launched is never
> over-reported in a scenario that doesn't error out or crash, and never
> under-reported in any scenario.  Otherwise static barriers may be even
> less useful than I thought.

I just went back to the thread on "parallel.c oblivion of
worker-startup failures" and refreshed my memory about what's going on
over there.  What's going on over there is (1) currently,
nworkers_launched can be over-reported in a scenario that doesn't
error out or crash and (2) I'm proposing to tighten things up so that
this is no longer the case.  Amit proposed making it the
responsibility of code that uses parallel.c to cope with
nworkers_launched being larger than the number that actually launched,
and my counter-proposal was to make it reliably ERROR when they don't
all launch.

So, thinking about this, I think that my proposal to use dynamic
barriers here seems like it will work regardless of who wins that
argument.  Your proposal to use static barriers and decrement the
party size based on the number of participants which fail to start
will work if I win that argument, but will not work if Amit wins that
argument.

It seems to me in general that dynamic barriers are to be preferred in
almost every circumstance, because static barriers require a longer
chain of assumptions.  We can't assume that the number of guests we
invite to the party will match the number that actually show up, so,
in the case of a static barrier, we have to make sure to adjust the
party size if some of the guests end up having to stay home with a
sick kid or their car breaks down or if they decide to go to the
neighbor's party instead.  Absentee guests are not intrinsically a
problem, but we have to make sure that we account for them in a
completely water-tight fashion.  On the other hand, with a dynamic
barrier, we don't need to worry about the guests that don't show up;
we only need to worry about the guests that DO show up.  As they come
in the door, we count them; as they leave, we count them again.  When
the numbers are equal, the party's over.  That seems more robust.

In particular, for parallel query, there is absolutely zero guarantee
that every worker reaches every plan node.  For a parallel utility
command, things seem a little better: we can assume  that workers are
started only for one particular purpose.  But even that might not be
true in the future.  For example, imagine a parallel CREATE INDEX on a
partitioned table that cascades to all children.  One can easily
imagine wanting to use the same workers for the whole operation and
spread them out across the pool of tasks much as Parallel Append does.
There's a good chance this will be faster than doing each index build
in turn with maximum parallelism.  And then the static barrier thing
goes right out the window again, because the number of participants is
determined dynamically.

I really struggle to think of any case where a static barrier is
better.   I mean, suppose we have an existing party and then decide to
hold a baking contest.  We'll use a barrier to separate the baking
phase from the judging phase.  One might think that, since the number
of participants is already decided, someone could initialize the
barrier with that number rather than making everyone attach.  But it
doesn't really work, because there's a race: while one process is
creating the barrier with participants = 10, the doctor's beeper goes
off and he leaves the party.  Now there could be some situation in
which we are absolutely certain that we know how many participants
we've got and it won't change, but I suspect that in almost every
scenario deciding to use a static barrier is going to be immediately
followed by a lot of angst about how we can be sure that the number of
participants will always be correct.

>> Okay. I'll work on adopting dynamic barriers in the way you described.
>> I just wanted to make sure that we're all on the same page about what
>> that looks like.
>
> Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach
> and then just exit, you'll need to detach from the barrier both in the
> case where the worker exits early because the phase is too high and
> the case where you attach in in time to help and run to completion;

In the first case, I guess this is because otherwise the other
participants will wait for us even though we're not really there any
more.  In the second case, I'm not sure why it matters whether we
detach.  If we've reached the highest possible phase number, nobody's
going to wait any more, so who cares?  (I mean, apart from tidiness.)

> (2) maybe workers could use BarrierArriveAndDetach() at the end (the
> leader needs to use BarrierArriveAndWait(), but the workers don't
> really need to wait for each other before they exit, do they?);

They don't need to wait for each other, but they do need to wait for
the leader, so I don't think this works.  Logically, there are two key
sequencing points.  First, the leader needs to wait for the workers to
finish sorting.  That's the barrier between phase 0 and phase 1.
Second, the workers need to wait for the leader to absorb their tapes.
That's the barrier between phase 1 and phase 2.  If the workers use
BarrierArriveAndWait to reach phase 1 and then BarrierArriveAndDetach,
they won't wait for the leader to be done adopting their tapes as they
do in the current patch.

But, hang on a minute.  Why do the workers need to wait for the leader
anyway?  Can't they just exit once they're done sorting?  I think the
original reason why this ended up in the patch is that we needed the
leader to assume ownership of the tapes to avoid having the tapes get
blown away when the worker exits.  But, IIUC, with sharedfileset.c,
that problem no longer exists.  The tapes are jointly owned by all of
the cooperating backends and the last one to detach from it will
remove them.  So, if the worker sorts, advertises that it's done in
shared memory, and exits, then nothing should get removed and the
leader can adopt the tapes whenever it gets around to it.

If that's correct, then we only need 2 phases, not 3.  Workers
BarrierAttach() before reading any data, exiting if the phase is not
0.  Otherwise, they then read data and sort it, then advertise the
final tape in shared memory, then BarrierArriveAndDetach().  The
leader does BarrierAttach() before launching any workers, then reads
data and sorts it if applicable, then does BarrierArriveAndWait().
When that returns, all workers are done sorting (and may or may not
have finished exiting) and the leader can take over their tapes and
everything is fine.  That's significantly simpler than my previous
outline, and also simpler than what the patch does today.

> (3)
> erm, maybe it's a problem that errors occurring in workers while the
> leader is waiting at a barrier won't unblock the leader (we don't
> detach from barriers on abort/exit) -- I'll look into this.

I think if there's an ERROR, the general parallelism machinery is
going to arrange to kill every worker, so nothing matters in that case
unless barrier waits ignore interrupts, which I'm pretty sure they
don't.  (Also: if they do, I'll hit the ceiling; that would be awful.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

18 January 2018, 21:14:35

On Thu, Jan 18, 2018 at 6:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Amit's reply to this part drew my attention to it.  I think this is
> entirely false.  Consider an aggregate that doesn't support partial
> aggregation, and a plan that looks like this:
>
> Aggregate
> -> Gather
>   -> Parallel Seq Scan
>     Filter: something fairly selective
>
> It is quite possible for this to be superior to a non-parallel plan
> even with only 1 worker and no parallel leader participation.  The
> worker can evaluate the filter qual, and the leader can evaluate the
> aggregate.  If the CPU costs of doing those computations are high
> enough to outweigh the costs of shuffling tuples between backends, we
> win.

That seems pretty far fetched. But even if it wasn't, my position
would not change. This could happen only because the planner
determined that it was the cheapest plan when
parallel_leader_participation happened to be off. But clearly a
"degenerate parallel CREATE INDEX" will never be faster than a serial
CREATE INDEX, and there is a simple way to always avoid one. So why
not do so?

I give up. I'll go ahead and make parallel_leader_participation=off
allow a degenerate parallel CREATE INDEX in the next version. I think
that it will make parallel_leader_participation less useful, with no
upside, but there doesn't seem to be much more that I can do about
that.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

18 January 2018, 21:17:35

On Wed, Jan 17, 2018 at 10:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> If you think it's worth the cycles, then I have no objection. I will
> point out that this means that everything that I say about
> ReindexIsProcessingIndex() no longer applies, because the relevant
> state will now be propagated. It doesn't need to be mentioned at all,
> and I don't even need to forbid builds on catalogs.
>
> Should I go ahead and restore builds on catalogs, and remove those
> comments, on the assumption that your patch will be committed before
> mine? Obviously parallel index builds on catalogs don't matter. OTOH,
> why not? Perhaps it's like the debate around HOT that took place over
> 10 years ago, where Tom insisted that HOT work with catalogs on
> general principle.

Yes, I think so.  If you (or someone else) can review that patch, I'll
go ahead and commit it, and then your patch can treat it as a solved
problem.  I'm not really worried about the cycles; the amount of
effort required here is surely very small compared to all of the other
things that have to be done when starting a parallel worker.

I'm not as dogmatic about the idea that everything must support system
catalogs or it's not worth doing as Tom is, but I do think it's better
if it can be done that way with reasonable effort.  When each new
feature comes with a set of unsupported corner cases, it becomes hard
for users to understand what will and will not actually work.  Now,
really big features like parallel query or partitioning or logical
replication generally do need to exclude some things in v1 or you can
never finish the project, but in this case plugging the gap seems
quite feasible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

18 January 2018, 21:27:39

On Thu, Jan 18, 2018 at 1:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> That seems pretty far fetched.

I don't think it is, and there are plenty of other examples.  All you
need is a query plan that involves significant CPU work both below the
Gather node and above the Gather node.  It's not difficult to find
plans like that; there are TPC-H queries that generate plans like
that.

> But even if it wasn't, my position
> would not change. This could happen only because the planner
> determined that it was the cheapest plan when
> parallel_leader_participation happened to be off. But clearly a
> "degenerate parallel CREATE INDEX" will never be faster than a serial
> CREATE INDEX, and there is a simple way to always avoid one. So why
> not do so?

That's an excellent argument for making parallel CREATE INDEX ignore
parallel_leader_participation entirely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

18 January 2018, 21:30:46

On Thu, Jan 18, 2018 at 6:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I see your point.  OTOH, I think we should have something for testing
> purpose as that helps in catching the bugs and makes it easy to write
> tests that cover worker part of the code.

This is about the question of whether or not we want to allow
parallel_leader_participation to prevent or allow a parallel CREATE
INDEX that has 1 parallel worker that does all the sorting, with the
leader simply consuming its output without doing any merging (a
"degenerate paralllel CREATE INDEX"). It is perhaps only secondarily
about the question of ripping out parallel_leader_participation
entirely.

> Can you please elaborate what part of optimizer are you talking about
> where without leader participation partial path will always lose to a
> serial sequential scan path?

See my remarks to Robert just now.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

18 January 2018, 21:35:03

On Thu, Jan 18, 2018 at 10:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 18, 2018 at 1:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> That seems pretty far fetched.
>
> I don't think it is, and there are plenty of other examples.  All you
> need is a query plan that involves significant CPU work both below the
> Gather node and above the Gather node.  It's not difficult to find
> plans like that; there are TPC-H queries that generate plans like
> that.

You need to have a very selective qual in the worker, that eliminates
most input (keeps the worker busy), and yet manages to keep the leader
busy rather than waiting on input from the gather.

>> But even if it wasn't, my position
>> would not change. This could happen only because the planner
>> determined that it was the cheapest plan when
>> parallel_leader_participation happened to be off. But clearly a
>> "degenerate parallel CREATE INDEX" will never be faster than a serial
>> CREATE INDEX, and there is a simple way to always avoid one. So why
>> not do so?
>
> That's an excellent argument for making parallel CREATE INDEX ignore
> parallel_leader_participation entirely.

I'm done making arguments about parallel_leader_participation. Tell me
what you want, and I'll do it.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

18 January 2018, 22:05:32

On Thu, Jan 18, 2018 at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Should I go ahead and restore builds on catalogs, and remove those
>> comments, on the assumption that your patch will be committed before
>> mine? Obviously parallel index builds on catalogs don't matter. OTOH,
>> why not? Perhaps it's like the debate around HOT that took place over
>> 10 years ago, where Tom insisted that HOT work with catalogs on
>> general principle.
>
> Yes, I think so.  If you (or someone else) can review that patch, I'll
> go ahead and commit it, and then your patch can treat it as a solved
> problem.  I'm not really worried about the cycles; the amount of
> effort required here is surely very small compared to all of the other
> things that have to be done when starting a parallel worker.

Review of your patch:

* SerializedReindexState could use some comments. At least a one liner
stating its basic purpose.

* The "System index reindexing support" comment block could do with a
passing acknowledgement of the fact that this is serialized for
parallel workers.

* Maybe the "Serialize reindex state" comment within
InitializeParallelDSM() should instead say something like "Serialize
indexes-pending-reindex state".

Other than that, looks good to me. It's a simple patch with a clear purpose.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

19 January 2018, 04:53:57

On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I just went back to the thread on "parallel.c oblivion of
> worker-startup failures" and refreshed my memory about what's going on
> over there.  What's going on over there is (1) currently,
> nworkers_launched can be over-reported in a scenario that doesn't
> error out or crash and (2) I'm proposing to tighten things up so that
> this is no longer the case.

I think that we need to be able to rely on nworkers_launched to not
over-report the number of workers launched. To be fair to Amit, I
haven't actually gone off and studied the problem myself, so it's not
fair to dismiss his point of view. It nevertheless seems to me that it
makes life an awful lot easier to be able to rely on
nworkers_launched.

> So, thinking about this, I think that my proposal to use dynamic
> barriers here seems like it will work regardless of who wins that
> argument.  Your proposal to use static barriers and decrement the
> party size based on the number of participants which fail to start
> will work if I win that argument, but will not work if Amit wins that
> argument.

Sorry, but I've changed my mind. I don't think barriers owned by
tuplesort.c will work for us (though I think we will still need a
synchronization primitive within nbtsort.c). The design that Robert
sketched for using barriers seemed fine at first. But then I realized:
what about the case where you have 2 spools?

I now understand why Thomas thought that I'd end up using static
barriers, because I now see that dynamic barriers have problems of
their own if used by tuplesort.c, even with the trick of only having
participants actually participate on the condition that they show up
before the party is over (before there is no tuples left for the
worker to consume). The idea of the leader using nworkers_launched as
the assumed-launched number of workers is pretty much baked into my
patch, because my patch makes tuplesorts composable (e.g. nbtsort.c
uses two tuplesorts when there is a unique index build/2 spools).

Do individual workers need to be prepared to back out of the main
spool's sort, but not the spool2 sort (for unique index builds), or
vice-versa? Clearly that's untenable, because they're going to need to
have both as long as they're participating in a parallel CREATE INDEX
(of a unique index) -- IndexBuildHeapScan() expects both at the same
time, but there is a race condition when launching workers with 2
spools. So does nbtsort.c need to own the barrier instead? If it does,
and if that barrier subsumes the responsibilities of tuplesort.c's
condition variables, then I don't see how that can avoid causing a
mess due to confusion about phases across tuplesorts/spools.

nbtsort.c *will* need some synchronization primitive, actually, (I'm
thinking of a condition variable), but only because of the fact that
nbtsort.c happens to want to aggregate statistics about the sort at
the end (for pg_index) -- this doesn't seem like tuplesort's problem
at all. In general, it's very natural to just call
tuplesort_leader_wait(), and have all the relevant details
encapsulated within tuplesort.c. We could make tuplesort_leader_wait()
totally optional, and just use the barrier within nbtsort.c for the
wait (more on that later).

> In particular, for parallel query, there is absolutely zero guarantee
> that every worker reaches every plan node.  For a parallel utility
> command, things seem a little better: we can assume  that workers are
> started only for one particular purpose.  But even that might not be
> true in the future.

I expect workers that are reported launched to show up eventually, or
report failure. They don't strictly have to do any work beyond just
showing up (finding no tuples, reaching tuplesort_performsort(), then
finally reaching tuplesort_end()). The spool2 issue I describe above
shows why this is. They own the state (tuplesort tuples) that they
consume, and may possibly have 2 or more tuplesorts. If they cannot do
the bare minimum of checking in with us, then we're in big trouble,
because that's indistinguishable from their having actually sorted
some tuples without our knowing that the leader ultimately gets to
consume them.

It wouldn't be impossible to use barriers for everything. That just
seems to be incompatible with tuplesorts being composable. Long ago,
nbtsort.c actually did the sorting, too. If that was still true, then
it would be rather a lot more like parallel hashjoin, I think. You
could then just have one barrier for one state machine (with one or
two spools). It seems clear that we should avoid teaching tuplesort.c
about nbtsort.c.

> But, hang on a minute.  Why do the workers need to wait for the leader
> anyway?  Can't they just exit once they're done sorting?  I think the
> original reason why this ended up in the patch is that we needed the
> leader to assume ownership of the tapes to avoid having the tapes get
> blown away when the worker exits.  But, IIUC, with sharedfileset.c,
> that problem no longer exists.

You're right. This is why we could make calling
tuplesort_leader_wait() optional. We only need one condition variable
in tuplesort.c. Which makes me even less inclined to make the
remaining workersFinishedCv condition variable into a barrier, since
it's not at all barrier-like. After all, workers don't care about each
other's progress, or where the leader is. The leader needs to wait
until all known-launched participants report having finished, which
seems like a typical reason to use a condition variable. That doesn't
seem phase-like at all. As for workers, they don't have phases ("done"
isn't a phase for them, because as you say, there is no need for them
to wait until the leader says they can go with the shared fileset
stuff -- that's the leader's problem alone.)

I guess the fact that tuplesort_leader_wait() could be optional means
that it could be removed, which means that we could in fact throw out
the last condition variable within tuplesort.c, and fully rely on
using a barrier for everything within nbtsort.c. However,
tuplesort_leader_wait() seems kind of like something that we should
have on general principle. And, more importantly, it would be tricky
to use a barrier even for this, because we still have that baked-in
assumption that nworkers_launched is the single source of truth about
the number of participants.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

19 January 2018, 15:52:26

On Thu, Jan 18, 2018 at 2:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Review of your patch:
>
> * SerializedReindexState could use some comments. At least a one liner
> stating its basic purpose.

Added a comment.

> * The "System index reindexing support" comment block could do with a
> passing acknowledgement of the fact that this is serialized for
> parallel workers.

Done.

> * Maybe the "Serialize reindex state" comment within
> InitializeParallelDSM() should instead say something like "Serialize
> indexes-pending-reindex state".

That would require corresponding changes in a bunch of other places,
possibly including the function names.  I think it's better to keep
the function names shorter and the comments matching the function
names, so I did not make this change.

> Other than that, looks good to me. It's a simple patch with a clear purpose.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

19 January 2018, 20:16:33

On Fri, Jan 19, 2018 at 4:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Other than that, looks good to me. It's a simple patch with a clear purpose.
>
> Committed.

Cool.

Clarity on what I should do about parallel_leader_participation in the
next revision would be useful at this point. You seem to either want
me to remove it from consideration entirely, or to remove the code
that specifically disallows a "degenerate parallel CREATE INDEX". I
need a final answer on that.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

19 January 2018, 20:32:51

On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 19, 2018 at 4:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Other than that, looks good to me. It's a simple patch with a clear purpose.
>>
>> Committed.
>
> Cool.
>
> Clarity on what I should do about parallel_leader_participation in the
> next revision would be useful at this point. You seem to either want
> me to remove it from consideration entirely, or to remove the code
> that specifically disallows a "degenerate parallel CREATE INDEX". I
> need a final answer on that.

Right.  I do think that we should do one of those things, and I lean
towards removing it entirely, but I'm not entirely sure.    Rather
than making an executive decision immediately, I'd like to wait a few
days to give others a chance to comment. I am hoping that we might get
some other opinions, especially from Thomas who implemented
parallel_leader_participation, or maybe Amit who has been reviewing
recently, or anyone else who is paying attention to this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

20 January 2018, 00:27:03

On Sat, Jan 20, 2018 at 6:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Clarity on what I should do about parallel_leader_participation in the
>> next revision would be useful at this point. You seem to either want
>> me to remove it from consideration entirely, or to remove the code
>> that specifically disallows a "degenerate parallel CREATE INDEX". I
>> need a final answer on that.
>
> Right.  I do think that we should do one of those things, and I lean
> towards removing it entirely, but I'm not entirely sure.    Rather
> than making an executive decision immediately, I'd like to wait a few
> days to give others a chance to comment. I am hoping that we might get
> some other opinions, especially from Thomas who implemented
> parallel_leader_participation, or maybe Amit who has been reviewing
> recently, or anyone else who is paying attention to this thread.

Well, I see parallel_leader_participation as having these reasons to exist:

1.  Gather could in rare circumstances not run the plan in the leader.
This can hide bugs.  It's good to be able to force that behaviour for
testing.

2.  Plans that tie up the leader process for a long time cause the
tuple queues to block, which reduces parallelism.  I speculate that
some people might want to turn that off in production, but at the very
least it seems useful for certain kinds of performance testing to be
able to remove this complication from the picture.

3.  The planner's estimations of parallel leader contribution are
somewhat bogus, especially if the startup cost is high.  It's useful
to be able to remove that problem from the picture sometimes, at least
for testing and development work.

Parallel CREATE INDEX doesn't have any of those problems.  The only
reason I can see for it to respect parallel_leader_participation = off
is for consistency with Gather.  If someone decides to run their
cluster with that setting, then it's slightly odd if CREATE INDEX
scans and sorts with one extra process, but it doesn't seem like a big
deal.

I vote for removing the GUC from consideration for now (ie always use
the leader), and revisiting the question again later when we have more
experience or if the parallel degree logic becomes more sophisticated
in future.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

20 January 2018, 04:33:28

On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I guess the fact that tuplesort_leader_wait() could be optional means
> that it could be removed, which means that we could in fact throw out
> the last condition variable within tuplesort.c, and fully rely on
> using a barrier for everything within nbtsort.c. However,
> tuplesort_leader_wait() seems kind of like something that we should
> have on general principle. And, more importantly, it would be tricky
> to use a barrier even for this, because we still have that baked-in
> assumption that nworkers_launched is the single source of truth about
> the number of participants.

On third thought, tuplesort_leader_wait() should be removed entirely,
and tuplesort.c should get entirely out of the IPC business (it should
do the bare minimum of recording/reading a little state in shared
memory, while knowing nothing about condition variables, barriers, or
anything declared in parallel.h). Thinking about dealing with 2 spools
at once clinched it for me -- calling tuplesort_leader_wait() for both
underlying Tuplesortstates was silly, especially because there is
still a need for an nbtsort.c-specific wait for workers to fill-in
ambuild stats. When I said "tuplesort_leader_wait() seems kind of like
something that we should have on general principle", I was wrong.

It's normal for parallel workers to have all kinds of overlapping
responsibilities, and tuplesort_leader_wait() was doing something that
I now imagine isn't desirable to most callers. They can easily provide
something equivalent at a higher level. Besides, they'll very likely
be forced to anyway, due to some high level, caller-specific need --
which is exactly what we see within nbtsort.c.

Attached patch details:

* The patch synchronizes processes used the approach just described.
Note that this allowed me to remove several #include statements within
tuplesort.c.

* The patch uses only a single condition variable for a single wait
within nbtsort.c, for the leader. No barriers are used at all (and, as
I said, tuplesort.c doesn't use condition variables anymore). Since
things are now very simple, I can't imagine anyone still arguing for
the use of barriers.

Note that I'm still relying on nworkers_launched as the single source
of truth on the number of participants that can be expected to
eventually show up (even if they end up doing zero real work). This
should be fine, because I think that it will end up being formally
guaranteed to be reliable by the work currently underway from Robert
and Amit. But even if I'm wrong about things going that way, and it
turns out that the leader needs to decide how many putative launched
workers don't "get lost" due to fork() failure (the approach which
Amit apparently advocates), then there *still* isn't much that needs
to change.

Ultimately, the leader needs to have the exact number of workers that
participated, because that's fundamental to the tuplesort approach to
parallel sort. If necessary, the leader can just figure it out in
whatever way it likes at one central point within nbtsort.c, before
the leader calls its main spool's tuplesort_begin_index_btree() --
that can happen fairly late in the process. Actually doing that (and
not just using nworkers_launched) seems questionable to me, because it
would be almost the first thing that the leader would do after
starting parallel mode -- why not just have the parallel
infrastructure do it for us, and for everyone else?

If the new tuplesort infrastructure is used in the executor at some
future date, then the leader will still need to figure out the number
of workers that reached tuplesort_begin* some other way. This
shouldn't be surprising to anyone -- tuplesort.h is very clear on this
point.

* I revised the tuplesort.h contract to account for the developments
already described (mostly that I've removed tuplesort_leader_wait()).

* The patch makes the IPC wait event CREATE INDEX specific, since
tuplesort no longer does any waits of its own -- it's now called
ParallelCreateIndexScan. Patch also removes the second wait event
entirely (the one that we called ParallelSortTapeHandover).

* We now support index builds on catalogs.

I rebased on top of Robert's recent "REINDEX state in parallel
workers" commit, 29d58fd3. Note that there was a bug here in error
paths that caused Robert's "can't happen" error to be raised (the
PG_CATCH() block call to ResetReindexProcessing()). I fixed this in
passing, by simply removing that one "can't happen" error. Note that
ResetReindexProcessing() is only called directly within
reindex_index()/IndexCheckExclusion(). This made the idea of
preserving the check in a diminished form (#includ'ing parallel.h
within index.c, in order to check if we're a parallel worker as a
condition of raising that "can't happen" error) seem unnecessary.

* The patch does not alter anything about
parallel_leader_participation, except the alterations that Robert
requested to the docs (he requested these alterations on the
assumption that we won't end up doing nothing special with
parallel_leader_participation).

I am waiting for a final decision on what is to be done about
parallel_leader_participation, but for now I've changed nothing.

* I removed BufFileView(). I also renamed BufFileViewAppend() to
BufFileAppend().

* I performed some other minor tweaks, including some requested by
Robert in his most recent round of review.

Thanks
-- 
Peter Geoghegan

Attachment

0001-Add-parallel-B-tree-index-build-sorting.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

20 January 2018, 05:52:51

On Sat, Jan 20, 2018 at 2:57 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sat, Jan 20, 2018 at 6:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> Clarity on what I should do about parallel_leader_participation in the
>>> next revision would be useful at this point. You seem to either want
>>> me to remove it from consideration entirely, or to remove the code
>>> that specifically disallows a "degenerate parallel CREATE INDEX". I
>>> need a final answer on that.
>>
>> Right.  I do think that we should do one of those things, and I lean
>> towards removing it entirely, but I'm not entirely sure.    Rather
>> than making an executive decision immediately, I'd like to wait a few
>> days to give others a chance to comment. I am hoping that we might get
>> some other opinions, especially from Thomas who implemented
>> parallel_leader_participation, or maybe Amit who has been reviewing
>> recently, or anyone else who is paying attention to this thread.
>
> Well, I see parallel_leader_participation as having these reasons to exist:
>
> 1.  Gather could in rare circumstances not run the plan in the leader.
> This can hide bugs.  It's good to be able to force that behaviour for
> testing.
>

Or reverse is also possible which means the workers won't get chance
to run the plan in which case we can use parallel_leader_participation
= off to test workers behavior.  As said before, I see only that as
the reason to keep parallel_leader_participation in this patch.  If we
decide to do that way, then I think we should remove the code that
specifically disallows a "degenerate parallel CREATE INDEX" as that
seems to be confusing.   If we go this way, then I think we should use
the wording suggested by Robert in one of its email [1] to describe
the usage of parallel_leader_participation.

BTW, is there any other way for "parallel create index" to force that
the work is done by workers?  I am insisting on having something which
can test the code path in workers because we have found quite a few
bugs using that idea.

[1] - https://www.postgresql.org/message-id/CA%2BTgmoYN-YQU9JsGQcqFLovZ-C%2BXgp1_xhJQad%3DcunGG-_p5gg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

20 January 2018, 06:03:22

On Fri, Jan 19, 2018 at 6:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Or reverse is also possible which means the workers won't get chance
> to run the plan in which case we can use parallel_leader_participation
> = off to test workers behavior.  As said before, I see only that as
> the reason to keep parallel_leader_participation in this patch.  If we
> decide to do that way, then I think we should remove the code that
> specifically disallows a "degenerate parallel CREATE INDEX" as that
> seems to be confusing.   If we go this way, then I think we should use
> the wording suggested by Robert in one of its email [1] to describe
> the usage of parallel_leader_participation.

I agree that parallel_leader_participation is only useful for testing
in the context of parallel CREATE INDEX. My concern with allowing a
"degenerate parallel CREATE INDEX" to go ahead is that
parallel_leader_participation generally isn't just intended for
testing by hackers (if it was, then I wouldn't care). But I'm now more
than willing to let this go.

> BTW, is there any other way for "parallel create index" to force that
> the work is done by workers?  I am insisting on having something which
> can test the code path in workers because we have found quite a few
> bugs using that idea.

I agree that this is essential (more so than supporting
parallel_leader_participation). You can use the parallel_workers table
storage parameter for this. When the storage param has been set, we
don't care about the amount of memory available to each worker. You
can stress-test the implementation as needed. (The storage param does
care about max_parallel_maintenance_workers, but you can set that as
high as you like.)

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

20 January 2018, 07:44:32

On Sat, Jan 20, 2018 at 8:33 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 19, 2018 at 6:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>> BTW, is there any other way for "parallel create index" to force that
>> the work is done by workers?  I am insisting on having something which
>> can test the code path in workers because we have found quite a few
>> bugs using that idea.
>
> I agree that this is essential (more so than supporting
> parallel_leader_participation). You can use the parallel_workers table
> storage parameter for this. When the storage param has been set, we
> don't care about the amount of memory available to each worker. You
> can stress-test the implementation as needed. (The storage param does
> care about max_parallel_maintenance_workers, but you can set that as
> high as you like.)
>

Right, but I think using parallel_leader_participation, you can do it
reliably and probably write some regression tests which can complete
in a predictable time.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

20 January 2018, 07:50:20

On Fri, Jan 19, 2018 at 8:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Right, but I think using parallel_leader_participation, you can do it
> reliably and probably write some regression tests which can complete
> in a predictable time.

Do what reliably? Guarantee that the leader will not participate as a
worker, but that workers will be used? If so, yes, you can get that.

The only issue is that you may not be able to launch parallel workers
due to hitting a limit like max_parallel_workers, in which case you'll
get a serial index build despite everything. Nothing we can do about
that, though.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

20 January 2018, 08:13:13

On Sat, Jan 20, 2018 at 10:20 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 19, 2018 at 8:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Right, but I think using parallel_leader_participation, you can do it
>> reliably and probably write some regression tests which can complete
>> in a predictable time.
>
> Do what reliably? Guarantee that the leader will not participate as a
> worker, but that workers will be used? If so, yes, you can get that.
>

Yes, that's what I mean.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

20 January 2018, 08:29:24

On Sat, Jan 20, 2018 at 7:03 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached patch details:
>
> * The patch synchronizes processes used the approach just described.
> Note that this allowed me to remove several #include statements within
> tuplesort.c.
>
> * The patch uses only a single condition variable for a single wait
> within nbtsort.c, for the leader. No barriers are used at all (and, as
> I said, tuplesort.c doesn't use condition variables anymore). Since
> things are now very simple, I can't imagine anyone still arguing for
> the use of barriers.
>
> Note that I'm still relying on nworkers_launched as the single source
> of truth on the number of participants that can be expected to
> eventually show up (even if they end up doing zero real work). This
> should be fine, because I think that it will end up being formally
> guaranteed to be reliable by the work currently underway from Robert
> and Amit. But even if I'm wrong about things going that way, and it
> turns out that the leader needs to decide how many putative launched
> workers don't "get lost" due to fork() failure (the approach which
> Amit apparently advocates), then there *still* isn't much that needs
> to change.
>
> Ultimately, the leader needs to have the exact number of workers that
> participated, because that's fundamental to the tuplesort approach to
> parallel sort.
>

I think I can see why this patch needs that.  Is it mainly for the
work you are doing in _bt_leader_heapscan where you are waiting for
all the workers to be finished?

 If necessary, the leader can just figure it out in
> whatever way it likes at one central point within nbtsort.c, before
> the leader calls its main spool's tuplesort_begin_index_btree() --
> that can happen fairly late in the process. Actually doing that (and
> not just using nworkers_launched) seems questionable to me, because it
> would be almost the first thing that the leader would do after
> starting parallel mode -- why not just have the parallel
> infrastructure do it for us, and for everyone else?
>

I think till now we don't have any such requirement, but if it is must
for this patch, then I don't think it is tough to do that.  We need to
write an API WaitForParallelWorkerToAttach() and then call for each
launched worker or maybe WaitForParallelWorkersToAttach() which can
wait for all workers to attach and report how many have successfully
attached.   It will have functionality of
WaitForBackgroundWorkerStartup and additionally it needs to check if
the worker is attached to the error queue.  We already have similar
API (WaitForReplicationWorkerAttach) for logical replication workers
as well.  Note that it might have a slight impact on the performance
because with this you need to wait for the workers to startup before
doing any actual work, but I don't think it should be noticeable for
large operations especially for operations like parallel create index.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

20 January 2018, 23:09:23

On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think I can see why this patch needs that.  Is it mainly for the
> work you are doing in _bt_leader_heapscan where you are waiting for
> all the workers to be finished?

Yes, though it's also needed for the leader tuplesort. It needs to be
able to discover worker runs by looking for temp files named 0 through
to $NWORKERS - 1.

The problem with seeing who shows up after a period of time, and
having the leader arbitrarily determine that to be the total number of
participants (while blocking further participants from joining) is
that I don't know *how long to wait*. This would probably work okay
for parallel CREATE INDEX, when the leader participates as a worker,
because you can check only when the leader is finished acting as a
worker. It stands to reason that that's enough time for worker
processes to at least show up, and be seen to show up. We can use the
duration of the leader's participation as a worker as a natural way to
decide how long to wait.

But what when the leader doesn't participate as a worker, for whatever
reason? Other uses for parallel tuplesort might typically have much
less leader participation as compared to parallel CREATE INDEX. In
short, ISTM that seeing who shows up is a bad strategy for parallel
tuplesort.

> I think till now we don't have any such requirement, but if it is must
> for this patch, then I don't think it is tough to do that.  We need to
> write an API WaitForParallelWorkerToAttach() and then call for each
> launched worker or maybe WaitForParallelWorkersToAttach() which can
> wait for all workers to attach and report how many have successfully
> attached.   It will have functionality of
> WaitForBackgroundWorkerStartup and additionally it needs to check if
> the worker is attached to the error queue.  We already have similar
> API (WaitForReplicationWorkerAttach) for logical replication workers
> as well.  Note that it might have a slight impact on the performance
> because with this you need to wait for the workers to startup before
> doing any actual work, but I don't think it should be noticeable for
> large operations especially for operations like parallel create index.

Actually, though it doesn't really look like it from the way things
are structured within nbtsort.c, I don't need to wait for workers to
start up (call the WaitForParallelWorkerToAttach() function you
sketched) before doing any real work within the leader. The leader can
participate as a worker, and only do this check afterwards. That will
work because the leader Tuplesortstate has yet to do any real work.
Nothing stops me from adding a new function to tuplesort, for the
leader, that lets the leader say: "New plan -- you should now expect
this many participants" (leader takes this reliable number from
eventual call to WaitForParallelWorkerToAttach()).

I admit that I had no idea that there is this issue with
nworkers_launched until very recently. But then, that field has
absolutely no comments.

--
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

21 January 2018, 07:38:42

On Sun, Jan 21, 2018 at 1:39 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Actually, though it doesn't really look like it from the way things
> are structured within nbtsort.c, I don't need to wait for workers to
> start up (call the WaitForParallelWorkerToAttach() function you
> sketched) before doing any real work within the leader. The leader can
> participate as a worker, and only do this check afterwards. That will
> work because the leader Tuplesortstate has yet to do any real work.
> Nothing stops me from adding a new function to tuplesort, for the
> leader, that lets the leader say: "New plan -- you should now expect
> this many participants" (leader takes this reliable number from
> eventual call to WaitForParallelWorkerToAttach()).
>
> I admit that I had no idea that there is this issue with
> nworkers_launched until very recently. But then, that field has
> absolutely no comments.
>

It would have been better if there were some comments besides that
field, but I think it has been covered at another place in the code.
See comments in LaunchParallelWorkers().

/*
* Start workers.
*
* The caller must be able to tolerate ending up with fewer workers than
* expected, so there is no need to throw an error here if registration
* fails.  It wouldn't help much anyway, because registering the worker in
* no way guarantees that it will start up and initialize successfully.
*/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

21 January 2018, 22:20:20

On Sat, Jan 20, 2018 at 8:38 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> It would have been better if there were some comments besides that
> field, but I think it has been covered at another place in the code.
> See comments in LaunchParallelWorkers().
>
> /*
> * Start workers.
> *
> * The caller must be able to tolerate ending up with fewer workers than
> * expected, so there is no need to throw an error here if registration
> * fails.  It wouldn't help much anyway, because registering the worker in
> * no way guarantees that it will start up and initialize successfully.
> */

Why is this okay for Gather nodes, though? nodeGather.c looks at
pcxt->nworkers_launched during initialization, and appears to at least
trust it to indicate that more than zero actually-launched workers
will also show up when "nworkers_launched > 0". This trust seems critical
when parallel_leader_participation is off, because "node->nreaders ==
0" overrides the parallel_leader_participation GUC's setting (note
that node->nreaders comes directly from pcxt->nworkers_launched). If
zero workers show up, and parallel_leader_participation is off, but
pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
never make forward progress?

Parallel CREATE INDEX does go a bit further. It assumes that
nworkers_launched *exactly* indicates the number of workers that
successfully underwent parallel initialization, and therefore can be
expected to show up.

Is there actually a meaningful difference between the way
nworkers_launched is depended upon in each case, though?

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

22 January 2018, 05:34:51

On Mon, Jan 22, 2018 at 12:50 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Jan 20, 2018 at 8:38 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> It would have been better if there were some comments besides that
>> field, but I think it has been covered at another place in the code.
>> See comments in LaunchParallelWorkers().
>>
>> /*
>> * Start workers.
>> *
>> * The caller must be able to tolerate ending up with fewer workers than
>> * expected, so there is no need to throw an error here if registration
>> * fails.  It wouldn't help much anyway, because registering the worker in
>> * no way guarantees that it will start up and initialize successfully.
>> */
>
> Why is this okay for Gather nodes, though? nodeGather.c looks at
> pcxt->nworkers_launched during initialization, and appears to at least
> trust it to indicate that more than zero actually-launched workers
> will also show up when "nworkers_launched > 0". This trust seems critical
> when parallel_leader_participation is off, because "node->nreaders ==
> 0" overrides the parallel_leader_participation GUC's setting (note
> that node->nreaders comes directly from pcxt->nworkers_launched). If
> zero workers show up, and parallel_leader_participation is off, but
> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
> never make forward progress?
>

Ideally, that situation should be detected and we should throw an
error, but that doesn't happen today.  However, it will be handled
with Robert's patch on the other thread for CF entry [1].


[1] - https://commitfest.postgresql.org/16/1341/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

22 January 2018, 08:06:26

On Sun, Jan 21, 2018 at 6:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Why is this okay for Gather nodes, though? nodeGather.c looks at
>> pcxt->nworkers_launched during initialization, and appears to at least
>> trust it to indicate that more than zero actually-launched workers
>> will also show up when "nworkers_launched > 0". This trust seems critical
>> when parallel_leader_participation is off, because "node->nreaders ==
>> 0" overrides the parallel_leader_participation GUC's setting (note
>> that node->nreaders comes directly from pcxt->nworkers_launched). If
>> zero workers show up, and parallel_leader_participation is off, but
>> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
>> never make forward progress?
>
> Ideally, that situation should be detected and we should throw an
> error, but that doesn't happen today.  However, it will be handled
> with Robert's patch on the other thread for CF entry [1].

I knew that, but I was confused by your sketch of the
WaitForParallelWorkerToAttach() API [1]. Specifically, your suggestion
that the problem was unique to nbtsort.c, or was at least something
that nbtsort.c had to take a special interest in. It now appears more
like a general problem with a general solution, and likely one that
won't need *any* changes to code in places like nodeGather.c (or
nbtsort.c, in the case of my patch).

I guess that you meant that parallel CREATE INDEX is the first thing
to care about the *precise* number of nworkers_launched -- that is
kind of a new thing. That doesn't seem like it makes any practical
difference to us, though. I don't see why nbtsort.c should take a
special interest in this problem, for example by calling
WaitForParallelWorkerToAttach() itself. I may have missed something,
but right now ISTM that it would be risky to make the API anything
other than what both nodeGather.c and nbtsort.c already expect (that
they'll either have nworkers_launched workers show up, or be able to
propagate an error).

[1] https://postgr.es/m/CAA4eK1KzvXTCFF8inhcEviUPxp4yWCS3rZuwjfqMttf75x2rvA@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

22 January 2018, 14:52:32

On Mon, Jan 22, 2018 at 10:36 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Jan 21, 2018 at 6:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Why is this okay for Gather nodes, though? nodeGather.c looks at
>>> pcxt->nworkers_launched during initialization, and appears to at least
>>> trust it to indicate that more than zero actually-launched workers
>>> will also show up when "nworkers_launched > 0". This trust seems critical
>>> when parallel_leader_participation is off, because "node->nreaders ==
>>> 0" overrides the parallel_leader_participation GUC's setting (note
>>> that node->nreaders comes directly from pcxt->nworkers_launched). If
>>> zero workers show up, and parallel_leader_participation is off, but
>>> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
>>> never make forward progress?
>>
>> Ideally, that situation should be detected and we should throw an
>> error, but that doesn't happen today.  However, it will be handled
>> with Robert's patch on the other thread for CF entry [1].
>
> I knew that, but I was confused by your sketch of the
> WaitForParallelWorkerToAttach() API [1]. Specifically, your suggestion
> that the problem was unique to nbtsort.c, or was at least something
> that nbtsort.c had to take a special interest in. It now appears more
> like a general problem with a general solution, and likely one that
> won't need *any* changes to code in places like nodeGather.c (or
> nbtsort.c, in the case of my patch).
>
> I guess that you meant that parallel CREATE INDEX is the first thing
> to care about the *precise* number of nworkers_launched -- that is
> kind of a new thing. That doesn't seem like it makes any practical
> difference to us, though. I don't see why nbtsort.c should take a
> special interest in this problem, for example by calling
> WaitForParallelWorkerToAttach() itself. I may have missed something,
> but right now ISTM that it would be risky to make the API anything
> other than what both nodeGather.c and nbtsort.c already expect (that
> they'll either have nworkers_launched workers show up, or be able to
> propagate an error).
>

The difference is that nodeGather.c doesn't have any logic like the
one you have in _bt_leader_heapscan where the patch waits for each
worker to increment nparticipantsdone.  For Gather node, we do such a
thing (wait for all workers to finish) by calling
WaitForParallelWorkersToFinish which will have the capability after
Robert's patch to detect if any worker is exited abnormally (fork
failure or failed before attaching to the error queue).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

22 January 2018, 23:15:48

On Mon, Jan 22, 2018 at 3:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The difference is that nodeGather.c doesn't have any logic like the
> one you have in _bt_leader_heapscan where the patch waits for each
> worker to increment nparticipantsdone.  For Gather node, we do such a
> thing (wait for all workers to finish) by calling
> WaitForParallelWorkersToFinish which will have the capability after
> Robert's patch to detect if any worker is exited abnormally (fork
> failure or failed before attaching to the error queue).

FWIW, I don't think that that's really much of a difference.

ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
similar to how _bt_end_parallel() calls
WaitForParallelWorkersToFinish() in the patch. The
_bt_leader_heapscan() condition variable wait for workers that you
refer to is quite a bit like how gather_readnext() behaves. It
generally checks to make sure that all tuple queues are done.
gather_readnext() can wait for developments using WaitLatch(), to make
sure every tuple queue is visited, with all output reliably consumed.

This doesn't look all that similar to  _bt_leader_heapscan(), I
suppose, but I think that that's only because it's normal for all
output to become available all at once for nbtsort.c workers. The
startup cost is close to or actually the same as the total cost, as it
*always* is for sort nodes.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

23 January 2018, 04:22:50

On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> But, hang on a minute.  Why do the workers need to wait for the leader
> anyway?  Can't they just exit once they're done sorting?  I think the
> original reason why this ended up in the patch is that we needed the
> leader to assume ownership of the tapes to avoid having the tapes get
> blown away when the worker exits.  But, IIUC, with sharedfileset.c,
> that problem no longer exists.  The tapes are jointly owned by all of
> the cooperating backends and the last one to detach from it will
> remove them.  So, if the worker sorts, advertises that it's done in
> shared memory, and exits, then nothing should get removed and the
> leader can adopt the tapes whenever it gets around to it.

BTW, I want to point out that using the shared fileset infrastructure
is only a very small impediment to adding randomAccess support. If we
really wanted to support randomAccess for the leader's tapeset, while
recycling blocks from worker BufFiles, it looks like all we'd have to
do is change PathNameOpenTemporaryFile() to open files O_RDWR, rather
than O_RDONLY (shared fileset BufFiles that are opened after export
always have O_RDONLY segments -- we'd also have to change some
assertions, as well as some comments). Overall, this approach looks
straightforward, and isn't something that I can find an issue with
after an hour or so of manual testing.

Now, I'm not actually suggesting we go that way. As you know,
randomAccess isn't used by CREATE INDEX, and randomAccess may never be
needed for any parallel sort operation. More importantly, Thomas made
PathNameOpenTemporaryFile() use O_RDONLY for a reason, and I don't
want to trade one special case (randomAccess disallowed for parallel
tuplesort leader tapeset) in exchange for another one (the logtape.c
calls to BufFileOpenShared() ask for read-write BufFiles, not
read-only BufFiles).

I'm pointing this out because this is something that should increase
confidence in the changes I've proposed to logtape.c. The fact that
randomAccess support *would* be straightforward is a sign that I
haven't accidentally introduced some other assumption, or special
case.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

23 January 2018, 05:45:22

On Tue, Jan 23, 2018 at 1:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jan 22, 2018 at 3:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> The difference is that nodeGather.c doesn't have any logic like the
>> one you have in _bt_leader_heapscan where the patch waits for each
>> worker to increment nparticipantsdone.  For Gather node, we do such a
>> thing (wait for all workers to finish) by calling
>> WaitForParallelWorkersToFinish which will have the capability after
>> Robert's patch to detect if any worker is exited abnormally (fork
>> failure or failed before attaching to the error queue).
>
> FWIW, I don't think that that's really much of a difference.
>
> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
> similar to how _bt_end_parallel() calls
> WaitForParallelWorkersToFinish() in the patch. The
> _bt_leader_heapscan() condition variable wait for workers that you
> refer to is quite a bit like how gather_readnext() behaves. It
> generally checks to make sure that all tuple queues are done.
> gather_readnext() can wait for developments using WaitLatch(), to make
> sure every tuple queue is visited, with all output reliably consumed.
>

The difference lies in the fact that in gather_readnext, we use tuple
queue mechanism which has the capability to detect that the workers
are stopped/exited whereas _bt_leader_heapscan doesn't have any such
capability, so I think it will loop forever.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

23 January 2018, 06:13:10

On Mon, Jan 22, 2018 at 6:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> FWIW, I don't think that that's really much of a difference.
>>
>> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
>> similar to how _bt_end_parallel() calls
>> WaitForParallelWorkersToFinish() in the patch. The
>> _bt_leader_heapscan() condition variable wait for workers that you
>> refer to is quite a bit like how gather_readnext() behaves. It
>> generally checks to make sure that all tuple queues are done.
>> gather_readnext() can wait for developments using WaitLatch(), to make
>> sure every tuple queue is visited, with all output reliably consumed.
>>
>
> The difference lies in the fact that in gather_readnext, we use tuple
> queue mechanism which has the capability to detect that the workers
> are stopped/exited whereas _bt_leader_heapscan doesn't have any such
> capability, so I think it will loop forever.

_bt_leader_heapscan() can detect when workers exit early, at least in
the vast majority of cases. It can do this simply by processing
interrupts and automatically propagating any error -- nothing special
about that. It can also detect when workers have finished
successfully, because of course, that's the main reason for its
existence. What remains, exactly?

I don't know that much about tuple queues, but from a quick read I
guess you might be talking about shm_mq_receive() +
shm_mq_wait_internal(). It's not obvious that that will work in all
cases ("Note that if handle == NULL, and the process fails to attach,
we'll potentially get stuck here forever"). Also, I don't see how this
addresses the parallel_leader_participation issue I raised.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

23 January 2018, 06:54:29

On Tue, Jan 23, 2018 at 8:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jan 22, 2018 at 6:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> FWIW, I don't think that that's really much of a difference.
>>>
>>> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
>>> similar to how _bt_end_parallel() calls
>>> WaitForParallelWorkersToFinish() in the patch. The
>>> _bt_leader_heapscan() condition variable wait for workers that you
>>> refer to is quite a bit like how gather_readnext() behaves. It
>>> generally checks to make sure that all tuple queues are done.
>>> gather_readnext() can wait for developments using WaitLatch(), to make
>>> sure every tuple queue is visited, with all output reliably consumed.
>>>
>>
>> The difference lies in the fact that in gather_readnext, we use tuple
>> queue mechanism which has the capability to detect that the workers
>> are stopped/exited whereas _bt_leader_heapscan doesn't have any such
>> capability, so I think it will loop forever.
>
> _bt_leader_heapscan() can detect when workers exit early, at least in
> the vast majority of cases. It can do this simply by processing
> interrupts and automatically propagating any error -- nothing special
> about that. It can also detect when workers have finished
> successfully, because of course, that's the main reason for its
> existence. What remains, exactly?
>

Will it able to detect fork failure or if worker exits before
attaching to error queue?  I think you can once try it by forcing fork
failure in do_start_bgworker and see the behavior of
_bt_leader_heapscan.  I could have tried and let you know the results,
but the latest patch doesn't seem to apply cleanly.

> I don't know that much about tuple queues, but from a quick read I
> guess you might be talking about shm_mq_receive() +
> shm_mq_wait_internal(). It's not obvious that that will work in all
> cases ("Note that if handle == NULL, and the process fails to attach,
> we'll potentially get stuck here forever"). Also, I don't see how this
> addresses the parallel_leader_participation issue I raised.
>

I am talking about shm_mq_receive->shm_mq_counterparty_gone.  In
shm_mq_counterparty_gone, it can detect if the worker is gone by using
GetBackgroundWorkerPid.





-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

23 January 2018, 21:36:53

On Mon, Jan 22, 2018 at 10:13 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> _bt_leader_heapscan() can detect when workers exit early, at least in
> the vast majority of cases. It can do this simply by processing
> interrupts and automatically propagating any error -- nothing special
> about that. It can also detect when workers have finished
> successfully, because of course, that's the main reason for its
> existence. What remains, exactly?

As Amit says, what remains is the case where fork() fails or the
worker dies before it reaches the line in ParallelWorkerMain that
reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
signaled until you call WaitForParallelWorkersToFinish().  If you wait
prior to that point for a number of workers equal to
nworkers_launched, you will wait forever in those cases.

I am going to repeat my previous suggest that we use a Barrier here.
Given the discussion subsequent to my original proposal, this can be a
lot simpler than what I suggested originally.  Each worker does
BarrierAttach() before beginning to read tuples (exiting if the phase
returned is non-zero) and BarrierArriveAndDetach() when it's done
sorting.  The leader does BarrierAttach() before launching workers and
BarrierArriveAndWait() when it's done sorting.  If we don't do this,
we're going to have to invent some other mechanism to count the
participants that actually initialize successfully, but that seems
like it's just duplicating code.

This proposal has some minor advantages even when no fork() failure or
similar occurs.  If, for example, one or more workers take a long time
to start, the leader doesn't have to wait for them before writing out
the index.  As soon as all the workers that attached to the Barrier
have arrived at the end of phase 0, the leader can build a new tape
set from all of the tapes that exist at that time.  It does not need
to wait for the remaining workers to start up and create empty tapes.
This is only a minor advantage since we probably shouldn't be doing
CREATE INDEX in parallel in the first place if the index build is so
short that this scenario is likely to occur, but we get it basically
for free, so why not?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

23 January 2018, 21:50:59

On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> As Amit says, what remains is the case where fork() fails or the
> worker dies before it reaches the line in ParallelWorkerMain that
> reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
> signaled until you call WaitForParallelWorkersToFinish().  If you wait
> prior to that point for a number of workers equal to
> nworkers_launched, you will wait forever in those cases.

Another option might be to actually call
WaitForParallelWorkersToFinish() in place of a condition variable or
barrier, as Amit suggested at one point.

> I am going to repeat my previous suggest that we use a Barrier here.
> Given the discussion subsequent to my original proposal, this can be a
> lot simpler than what I suggested originally.  Each worker does
> BarrierAttach() before beginning to read tuples (exiting if the phase
> returned is non-zero) and BarrierArriveAndDetach() when it's done
> sorting.  The leader does BarrierAttach() before launching workers and
> BarrierArriveAndWait() when it's done sorting.  If we don't do this,
> we're going to have to invent some other mechanism to count the
> participants that actually initialize successfully, but that seems
> like it's just duplicating code.

I think that this closes the door to leader non-participation as
anything other than a developer-only debug option, which might be
fine. If parallel_leader_participation=off (or some way of getting the
same behavior through a #define) is to be retained, then an artificial
wait is required as a substitute for the leader's participation as a
worker.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

23 January 2018, 22:11:43

On Tue, Jan 23, 2018 at 10:50 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I am going to repeat my previous suggest that we use a Barrier here.
>> Given the discussion subsequent to my original proposal, this can be a
>> lot simpler than what I suggested originally.  Each worker does
>> BarrierAttach() before beginning to read tuples (exiting if the phase
>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>> sorting.  The leader does BarrierAttach() before launching workers and
>> BarrierArriveAndWait() when it's done sorting.  If we don't do this,
>> we're going to have to invent some other mechanism to count the
>> participants that actually initialize successfully, but that seems
>> like it's just duplicating code.
>
> I think that this closes the door to leader non-participation as
> anything other than a developer-only debug option, which might be
> fine. If parallel_leader_participation=off (or some way of getting the
> same behavior through a #define) is to be retained, then an artificial
> wait is required as a substitute for the leader's participation as a
> worker.

This idea of an artificial wait seems pretty grotty to me. If we made
it one second, would that be okay with Valgrind builds? And when it
wasn't sufficient, wouldn't we be back to waiting forever?

Finally, it's still not clear to me why nodeGather.c's use of
parallel_leader_participation=off doesn't suffer from similar problems
[1].

[1] https://postgr.es/m/CAH2-Wz=cAMX5btE1s=aTz7CLwzpEPm_NsUhAMAo5t5=1i9VcwQ@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

24 January 2018, 00:07:00

On Tue, Jan 23, 2018 at 2:11 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Finally, it's still not clear to me why nodeGather.c's use of
> parallel_leader_participation=off doesn't suffer from similar problems
> [1].

Thomas and I just concluded that it does.  See my email on the other
thread just now.

I thought that I had the failure cases all nailed down here now, but I
guess not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

24 January 2018, 01:08:53

On Fri, Jan 19, 2018 at 6:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> (3)
>> erm, maybe it's a problem that errors occurring in workers while the
>> leader is waiting at a barrier won't unblock the leader (we don't
>> detach from barriers on abort/exit) -- I'll look into this.
>
> I think if there's an ERROR, the general parallelism machinery is
> going to arrange to kill every worker, so nothing matters in that case
> unless barrier waits ignore interrupts, which I'm pretty sure they
> don't.  (Also: if they do, I'll hit the ceiling; that would be awful.)

(After talking this through with Robert off-list).  Right, the
CHECK_FOR_INTERRUPTS() in ConditionVariableSleep() handles errors from
parallel workers.  There is no problem here.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

24 January 2018, 07:59:19

On Wed, Jan 24, 2018 at 12:20 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> As Amit says, what remains is the case where fork() fails or the
>> worker dies before it reaches the line in ParallelWorkerMain that
>> reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
>> signaled until you call WaitForParallelWorkersToFinish().  If you wait
>> prior to that point for a number of workers equal to
>> nworkers_launched, you will wait forever in those cases.
>
> Another option might be to actually call
> WaitForParallelWorkersToFinish() in place of a condition variable or
> barrier, as Amit suggested at one point.
>

Yes, the only thing that is slightly worrying about using
WaitForParallelWorkersToFinish is that backend leader needs to wait
for workers to finish rather than just finishing sort related work.  I
think there shouldn't be much difference between when the sort is done
and the workers actually finish the remaining resource cleanup.
However, OTOH, if we are not okay with this solution and want to go
with some kind of usage of barriers to solve this problem, then we can
evaluate that as well, but I feel it is better if we can use the
method which is used in other parallelism code to solve this problem
(which is to use WaitForParallelWorkersToFinish).

>> I am going to repeat my previous suggest that we use a Barrier here.
>> Given the discussion subsequent to my original proposal, this can be a
>> lot simpler than what I suggested originally.  Each worker does
>> BarrierAttach() before beginning to read tuples (exiting if the phase
>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>> sorting.  The leader does BarrierAttach() before launching workers and
>> BarrierArriveAndWait() when it's done sorting.
>>

How does leader detect if one of the workers does BarrierAttach and
then fails (either exits or error out) before doing
BarrierArriveAndDetach?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

24 January 2018, 08:06:25

On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I am going to repeat my previous suggest that we use a Barrier here.
>>> Given the discussion subsequent to my original proposal, this can be a
>>> lot simpler than what I suggested originally.  Each worker does
>>> BarrierAttach() before beginning to read tuples (exiting if the phase
>>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>>> sorting.  The leader does BarrierAttach() before launching workers and
>>> BarrierArriveAndWait() when it's done sorting.
>
> How does leader detect if one of the workers does BarrierAttach and
> then fails (either exits or error out) before doing
> BarrierArriveAndDetach?

If you attach and then exit cleanly, that's a programming error and
would cause anyone who runs BarrierArriveAndWait() to hang forever.
If you attach and raise an error, the leader will receive an error
message via CFI() and will then raise an error itself and terminate
all workers during cleanup.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Amit Kapila

Date:

24 January 2018, 08:43:11

On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> I am going to repeat my previous suggest that we use a Barrier here.
>>>> Given the discussion subsequent to my original proposal, this can be a
>>>> lot simpler than what I suggested originally.  Each worker does
>>>> BarrierAttach() before beginning to read tuples (exiting if the phase
>>>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>>>> sorting.  The leader does BarrierAttach() before launching workers and
>>>> BarrierArriveAndWait() when it's done sorting.
>>
>> How does leader detect if one of the workers does BarrierAttach and
>> then fails (either exits or error out) before doing
>> BarrierArriveAndDetach?
>
> If you attach and then exit cleanly, that's a programming error and
> would cause anyone who runs BarrierArriveAndWait() to hang forever.
>

Right, but what if the worker dies due to something proc_exit(1) or
something like that before calling BarrierArriveAndWait.  I think this
is part of the problem we have solved in
WaitForParallelWorkersToFinish such that if the worker exits abruptly
at any point due to some reason, the system should not hang.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

24 January 2018, 09:53:11

On Wed, Jan 24, 2018 at 6:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>> I am going to repeat my previous suggest that we use a Barrier here.
>>>>> Given the discussion subsequent to my original proposal, this can be a
>>>>> lot simpler than what I suggested originally.  Each worker does
>>>>> BarrierAttach() before beginning to read tuples (exiting if the phase
>>>>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>>>>> sorting.  The leader does BarrierAttach() before launching workers and
>>>>> BarrierArriveAndWait() when it's done sorting.
>>>
>>> How does leader detect if one of the workers does BarrierAttach and
>>> then fails (either exits or error out) before doing
>>> BarrierArriveAndDetach?
>>
>> If you attach and then exit cleanly, that's a programming error and
>> would cause anyone who runs BarrierArriveAndWait() to hang forever.
>>
>
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait.  I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.

Actually what I said before is no longer true: after commit 2badb5af,
if you exit unexpectedly then the new ParallelWorkerShutdown() exit
hook delivers PROCSIG_PARALLEL_MESSAGE (apparently after detaching
from the error queue) and the leader aborts when it tries to read the
error queue.  I just hacked Parallel Hash like this:

                BarrierAttach(build_barrier);
+               if (ParallelWorkerNumber == 0)
+               {
+                       pg_usleep(1000000);
+                       proc_exit(1);
+               }

Now I see:

postgres=# select count(*) from foox r join foox s on r.a = s.a;
ERROR:  lost connection to parallel worker

Using a debugger I can see the leader raising that error with this stack:

HandleParallelMessages at parallel.c:890
ProcessInterrupts at postgres.c:3053
ConditionVariableSleep(cv=0x000000010a62e4c8,
wait_event_info=134217737) at condition_variable.c:151
BarrierArriveAndWait(barrier=0x000000010a62e4b0,
wait_event_info=134217737) at barrier.c:191
MultiExecParallelHash(node=0x00007ffcd9050b10) at nodeHash.c:312
MultiExecHash(node=0x00007ffcd9050b10) at nodeHash.c:112
MultiExecProcNode(node=0x00007ffcd9050b10) at execProcnode.c:502
ExecParallelHashJoin [inlined]
ExecHashJoinImpl(pstate=0x00007ffcda01baa0, parallel='\x01') at
nodeHashjoin.c:291
ExecParallelHashJoin(pstate=0x00007ffcda01baa0) at nodeHashjoin.c:582

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

24 January 2018, 22:54:49

On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait.  I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.

I have used Thomas' chaos-monkey-fork-process.patch to verify:

1. The problem of fork failure causing nbtsort.c to wait forever is a
real problem. Sure enough, the coding pattern within
_bt_leader_heapscan() can cause us to wait forever even with commit
2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
consequence of the patch not using tuple queues (it uses the new
tuplesort sharing thing instead).

2. Simply adding a single call to WaitForParallelWorkersToFinish()
within _bt_leader_heapscan() before waiting on our condition variable
fixes the problem -- errors are reliably propagated, and we never end
up waiting forever.

3. This short term fix works just as well with
parallel_leader_participation=off.

At this point, my preferred solution is for someone to go implement
Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
like the logical person for the job). Once that's committed, I can
post a new version of the patch that uses that new infrastructure --
I'll add a call to the new function, without changing anything else.
Failing that, we could actually just use
WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
mostly because it complicates  parallel_leader_participation=off,
something that Amit is in agreement with [2][3].

For now, I am waiting for feedback from Robert on next steps.

[1] https://postgr.es/m/CAH2-Wzm6dF=g9LYwthgCqzRc4DzBE-8Tv28Yvg0XJ8Q6e4+cBQ@mail.gmail.com
[2] https://postgr.es/m/CAA4eK1LEFd28p1kw2Fst9LzgBgfMbDEq9wPh9jWFC0ye6ce62A%40mail.gmail.com
[3] https://postgr.es/m/CAA4eK1+a0OF4M231vBgPr_0Ygg_BNmRGZLiB7WQDE-FYBSyrGg@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

24 January 2018, 23:13:10

On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>
> 1. The problem of fork failure causing nbtsort.c to wait forever is a
> real problem. Sure enough, the coding pattern within
> _bt_leader_heapscan() can cause us to wait forever even with commit
> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
> consequence of the patch not using tuple queues (it uses the new
> tuplesort sharing thing instead).

Just curious: does the attached also help?

> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
> within _bt_leader_heapscan() before waiting on our condition variable
> fixes the problem -- errors are reliably propagated, and we never end
> up waiting forever.

That does seem like a nice, simple solution and I am not against it.
The niggling thing that bothers me about it, though, is that it
requires the client of parallel.c to follow a slightly complicated
protocol or risk a rare obscure failure mode, and recognise the cases
where that's necessary.  Specifically, if you're not blocking in a
shm_mq wait loop, then you must make a call to this new interface
before you do any other kind of latch wait, but if you get that wrong
you'll probably not notice since fork failure is rare!  It seems like
it'd be nicer if we could figure out a way to make it so that any
latch/CFI loop would automatically be safe against fork failure.  The
attached (if it actually works, I dunno) is the worst way, but I
wonder if there is some way to traffic just a teensy bit more
information from postmaster to leader so that it could be efficient...

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

pessimistic-fork-failure-detector-v2.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

24 January 2018, 23:28:46

On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>
>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>> real problem. Sure enough, the coding pattern within
>> _bt_leader_heapscan() can cause us to wait forever even with commit
>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>> consequence of the patch not using tuple queues (it uses the new
>> tuplesort sharing thing instead).
>
> Just curious: does the attached also help?

I can still reproduce the problem without the fix I described (which
does work), using your patch instead.

Offhand, I suspect that the way you set ParallelMessagePending may not
always leave it set when it should be.

>> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
>> within _bt_leader_heapscan() before waiting on our condition variable
>> fixes the problem -- errors are reliably propagated, and we never end
>> up waiting forever.
>
> That does seem like a nice, simple solution and I am not against it.
> The niggling thing that bothers me about it, though, is that it
> requires the client of parallel.c to follow a slightly complicated
> protocol or risk a rare obscure failure mode, and recognise the cases
> where that's necessary.  Specifically, if you're not blocking in a
> shm_mq wait loop, then you must make a call to this new interface
> before you do any other kind of latch wait, but if you get that wrong
> you'll probably not notice since fork failure is rare!  It seems like
> it'd be nicer if we could figure out a way to make it so that any
> latch/CFI loop would automatically be safe against fork failure.

It would certainly be nicer, but I don't see much risk if we add a
comment next to nworkers_launched that said: Don't trust this until
you've called (Amit's proposed) WaitForParallelWorkersToAttach()
function, unless you're using the tuple queue infrastructure, which
lets you not need to directly care about the distinction between a
launched worker never starting, and a launched worker successfully
completing.

While I agree with what Robert said on the other thread -- "I guess
that works, but it seems more like blind luck than good design.
Parallel CREATE INDEX fails to be as "lucky" as Gather" -- that
doesn't mean that that situation cannot be formalized. And even if it
isn't formalized, then I think that that will probably be because
Gather ends up doing almost the same thing.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Thomas Munro

Date:

25 January 2018, 04:31:19

On Thu, Jan 25, 2018 at 9:28 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>>
>>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>>> real problem. Sure enough, the coding pattern within
>>> _bt_leader_heapscan() can cause us to wait forever even with commit
>>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>>> consequence of the patch not using tuple queues (it uses the new
>>> tuplesort sharing thing instead).
>>
>> Just curious: does the attached also help?
>
> I can still reproduce the problem without the fix I described (which
> does work), using your patch instead.
>
> Offhand, I suspect that the way you set ParallelMessagePending may not
> always leave it set when it should be.

Here's a version that works, and a minimal repro test module thing.
Without 0003 applied, it hangs.  With 0003 applied, it does this:

postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker

I won't be surprised if 0003 is judged to be a horrendous abuse of the
interrupt system, but these patches might at least be useful for
understanding the problem.

-- 
Thomas Munro
http://www.enterprisedb.com

On Sat, Jan 27, 2018 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have posted the patch for the above API and posted it on a new
> thread [1]. Do let me know either here or on that thread if the patch
> suffices your need?

I've responded to you over on that thread. Thanks again for helping me.

I already have a revision of my patch lined up that is coded to target
your new WaitForParallelWorkersToAttach() interface, plus some other
changes. These include:

* Make the leader's worker Tuplesortstate complete before the main
leader Tuplesortstate even begins, making it very clear that nbtsort.c
does not rely on knowing the number of launched workers up-front. That
should make Robert a bit happier about our ability to add additional
workers fairly late in the process, in a future tuplesort client that
finds that to be useful.

* I've added a new "parallel" argument to index_build(), which
controls whether or not we even call the plan_create_index_workers()
cost model. When this is false, we always do a serial build. This was
added because I noticed that TRUNCATE REINDEXes the table at a time
when parallelism couldn't possibly be useful, which still used
parallelism. Best to have the top-level caller opt in or opt out.

* Polished the docs some more.

* Improved commentary on randomAccess/writable leader handling within
logtape.c. We could still support that, if we were willing to make
shared Buffiles that are opened within another backend writable. I'm
not proposing to do that, but it's nice that we could.

I hesitate to post something that won't cleanly apply on the master
branch's tip, but otherwise I am ready to send this new revision of
the patch right away. It seems likely that Robert will commit your
patch within a matter of days, once some minor issues are worked
through, at which point I'll send what I have. if anyone prefers, I
can post the patch immediately, and break out the
WaitForParallelWorkersToAttach() as the second patch in a cumulative
patch set. Right now, I'm out of things to work on here.

Notes on how I've stress-tested parallel CREATE INDEX:

I can recommend using the amcheck heapallverified functionality [1]
from the Github version of amcheck to test this patch. You will need
to modify the call to IndexBuildHeapScan() that the extension makes,
to add a new NULL "scan" argument, since parallel CREATE INDEX changes
the signature of IndexBuildHeapScan(). That's trivial, though.

Note that parallel CREATE INDEX should produce relfiles that are
physically identical to a serial CREATE INDEX, since index tuplesorts
are generally always deterministic. IOW, we use a heap TID tie-breaker
within tuplesort.c from B-Tree index tuples, which assures us that
varying maintenance_work_mem won't affect the final output even in a
tiny, insignificant way -- using parallelism should not change
anything about the exact output, either. At one point I was testing
this patch by verifying not only that indexes were sane, but that they
were physically identical to what a serial sort (in the master branch)
produced (I only needed to mask page LSNs).

Finally, yet another good way to test this patch is to verify that
everything continues to work when MAX_PHYSICAL_FILESIZE is modified to
be BLCKSZ (2^13 rather than 2^30). You will get many many BufFile
segments that way, which could in theory reveal bugs in rare edge
cases that I haven't considered. This is a strategy that led to my
finding a bug in v10 at one point [2], as well as bugs in earlier
versions of Thomas' parallel hash join patch set. It worked for me
twice already, so it seems like a good strategy. It may be worth
*combining* with some other stress-testing strategy.

[1] https://github.com/petergeoghegan/amcheck#optional-heapallindexed-verification
[2] https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com
--
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

02 February 2018, 19:16:50

On Mon, Jan 29, 2018 at 4:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Jan 27, 2018 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have posted the patch for the above API and posted it on a new
>> thread [1].  Do let me know either here or on that thread if the patch
>> suffices your need?
>
> I've responded to you over on that thread. Thanks again for helping me.
>
> I already have a revision of my patch lined up that is coded to target
> your new WaitForParallelWorkersToAttach() interface, plus some other
> changes.

Attached patch has these changes.

-- 
Peter Geoghegan

Attachment

0001-Add-parallel-B-tree-index-build-sorting.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

02 February 2018, 21:37:13

On Fri, Feb 2, 2018 at 11:16 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Attached patch has these changes.

And that patch you attached is also, now, committed.

If you could keep an eye on the buildfarm and investigate anything
that breaks, I would appreciate it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

02 February 2018, 21:38:27

On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> And that patch you attached is also, now, committed.
>
> If you could keep an eye on the buildfarm and investigate anything
> that breaks, I would appreciate it.

Fantastic!

I can keep an eye on it throughout the day.

Thanks everyone
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

02 February 2018, 23:23:20

On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Thanks everyone

I would like to acknowledge the assistance of Corey Huinker with early
testing of the patch (this took place in 2016, and much of it was not
on-list). Even though he wasn't credited in the commit message, he
should appear in the V11 release notes reviewer list IMV. His
contribution certainly merits it.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

02 February 2018, 23:30:07

On Fri, Feb 2, 2018 at 3:23 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Thanks everyone
>
> I would like to acknowledge the assistance of Corey Huinker with early
> testing of the patch (this took place in 2016, and much of it was not
> on-list). Even though he wasn't credited in the commit message, he
> should appear in the V11 release notes reviewer list IMV. His
> contribution certainly merits it.

For the record, I typically construct the list of reviewers by reading
over the thread and adding all the people whose names I find there in
chronological order, excluding things that are clearly not review
(like "Bumped to next CF.") and opinions on narrow questions that
don't indicate that any code-reading or testing was done (like "+1 for
calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
copied Corey on the original email, but I see no posts from him on the
thread, which is why he didn't get included in the commit message.
While I have no problem with him being included in the release notes,
I obviously can't know about activity that happens entirely off-list.
If you mentioned somewhere in the 200+ message on this topic that he
should be included, I missed that, too.  I think it's much harder to
give credit adequately when contributions are off-list; letting
everyone know what's going on is why we have a list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

02 February 2018, 23:35:20

On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> For the record, I typically construct the list of reviewers by reading
> over the thread and adding all the people whose names I find there in
> chronological order, excluding things that are clearly not review
> (like "Bumped to next CF.") and opinions on narrow questions that
> don't indicate that any code-reading or testing was done (like "+1 for
> calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
> copied Corey on the original email, but I see no posts from him on the
> thread, which is why he didn't get included in the commit message.

I did credit him in my own proposed commit message. I know that it's
not part of your workflow to preserve that, but I had assumed that
that would at least be taken into account.

Anyway, mistakes like this happen. I'm glad that we now have the
reviewer credit list, so that they can be corrected afterwards.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

02 February 2018, 23:49:21

On Fri, Feb 2, 2018 at 3:35 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> For the record, I typically construct the list of reviewers by reading
>> over the thread and adding all the people whose names I find there in
>> chronological order, excluding things that are clearly not review
>> (like "Bumped to next CF.") and opinions on narrow questions that
>> don't indicate that any code-reading or testing was done (like "+1 for
>> calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
>> copied Corey on the original email, but I see no posts from him on the
>> thread, which is why he didn't get included in the commit message.
>
> I did credit him in my own proposed commit message. I know that it's
> not part of your workflow to preserve that, but I had assumed that
> that would at least be taken into account.

Ah.  Sorry, I didn't look at that.  I try to remember to look at
proposed commit messages, but not everyone includes them, which is
probably part of the reason I don't always remember to look for them.
Or maybe I just have failed to adequately develop that habit...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

03 February 2018, 00:35:59

On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> If you could keep an eye on the buildfarm and investigate anything
>> that breaks, I would appreciate it.

> I can keep an eye on it throughout the day.

There is a benign Valgrind error that causes the lousyjack animal to
report failure. It looks like this:

==6850== Syscall param write(buf) points to uninitialised byte(s)
==6850==    at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so)
==6850==    by 0x82328F: FileWrite (fd.c:2017)
==6850==    by 0x8261AD: BufFileDumpBuffer (buffile.c:513)
==6850==    by 0x826569: BufFileFlush (buffile.c:657)
==6850==    by 0x8262FB: BufFileRead (buffile.c:561)
==6850==    by 0x9F6C79: ltsReadBlock (logtape.c:273)
==6850==    by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906)
==6850==    by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477)
==6850==    by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499)
==6850==    by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823)

I'll need to go and write a Valgrind suppression for this. I'll get to
it later today.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Andres Freund

Date:

03 February 2018, 00:58:25

On 2018-02-02 13:35:59 -0800, Peter Geoghegan wrote:
> On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> > On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> If you could keep an eye on the buildfarm and investigate anything
> >> that breaks, I would appreciate it.
> 
> > I can keep an eye on it throughout the day.
> 
> There is a benign Valgrind error that causes the lousyjack animal to
> report failure. It looks like this:
> 
> ==6850== Syscall param write(buf) points to uninitialised byte(s)
> ==6850==    at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so)
> ==6850==    by 0x82328F: FileWrite (fd.c:2017)
> ==6850==    by 0x8261AD: BufFileDumpBuffer (buffile.c:513)
> ==6850==    by 0x826569: BufFileFlush (buffile.c:657)
> ==6850==    by 0x8262FB: BufFileRead (buffile.c:561)
> ==6850==    by 0x9F6C79: ltsReadBlock (logtape.c:273)
> ==6850==    by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906)
> ==6850==    by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477)
> ==6850==    by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499)
> ==6850==    by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823)

Not saying you're wrong, but you should include a comment on why this is
a benign warning. Presumably it's some padding memory somewhere, but
it's not obvious from the above bleat.

Greetings,

Andres Freund

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

03 February 2018, 03:31:50

On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
> Not saying you're wrong, but you should include a comment on why this is
> a benign warning. Presumably it's some padding memory somewhere, but
> it's not obvious from the above bleat.

Sure. This looks slightly more complicated than first anticipated, but
I'll keep everyone posted.

Valgrind suppression aside, this raises another question. The stack
trace shows that the error happens during the creation of a new TOAST
table (CheckAndCreateToastTable()). I wonder if I should also pass
down a flag that makes sure that parallelism is never even attempted
from that path, to match TRUNCATE's suppression of parallel index
builds during its reindexing. It really shouldn't be a problem as
things stand, but maybe it's better to be consistent about "useless"
parallel CREATE INDEX attempts, and suppress them here too.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

03 February 2018, 06:26:59

On Fri, Feb 2, 2018 at 4:31 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
>> Not saying you're wrong, but you should include a comment on why this is
>> a benign warning. Presumably it's some padding memory somewhere, but
>> it's not obvious from the above bleat.
>
> Sure. This looks slightly more complicated than first anticipated, but
> I'll keep everyone posted.

I couldn't make up my mind if it was best to prevent the uninitialized
write(), or to instead just add a suppression. I eventually decided
upon the suppression -- see attached patch. My proposed commit message
has a full explanation of the Valgrind issue, which I won't repeat
here. Go read it before reading the rest of this e-mail.

It might seem like my suppression is overly broad, or not broad
enough, since it essentially targets LogicalTapeFreeze(). I don't
think it is, though, because this can occur in two places within
LogicalTapeFreeze() -- it can occur in the place we actually saw the
issue on lousyjack, from the ltsReadBlock() call within
LogicalTapeFreeze(), as well as a second place -- when
BufFileExportShared() is called. I found that you have to tweak code
to prevent it happening in the first place before you'll see it happen
in the second place. I see no point in actually playing whack-a-mole
for a totally benign issue like this, though, which made me finally
decide upon the suppression approach.

Bear in mind that a third way of fixing this would be to allocate
logtape.c buffers using palloc0() rather than palloc() (though I don't
like that idea at all). For serial external sorts, the logtape.c
buffers are guaranteed to have been written to/initialized at least
once as part of spilling a sort to disk. Parallel external sorts don't
quite guarantee that, which is why we run into this Valgrind issue.

-- 
Peter Geoghegan

Attachment

0001-Add-logtape.c-Valgrind-suppression.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

05 February 2018, 20:43:44

On Fri, Feb 2, 2018 at 10:26 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> My proposed commit message
> has a full explanation of the Valgrind issue, which I won't repeat
> here. Go read it before reading the rest of this e-mail.

I'm going to paste the first two sentences of your proposed commit
message in here for the convenience of other readers, since I want to
reply to them.

# LogicalTapeFreeze() may write out its first block when it is dirty but
# not full, and then immediately read the first block back in from its
# BufFile as a BLCKSZ-width block.  This can only occur in rare cases
# where next to no tuples were written out, which is only possible with
# parallel external tuplesorts.

So, if I understand correctly what you're saying here, valgrind is
totally cool with us writing out an only-partially-initialized block
to a disk file, but it's got a real problem with us reading that data
back into the same memory space it already occupies.  That's a little
odd.  I presume that it's common for the tail of the final block
written to be uninitialized, but normally when we then go read block
0, that's some other, fully initialized block.

It seems like it would be pretty easy to just suppress the useless
read when we've already got the correct data, and I'd lean toward
going that direction since it's a valid optimization anyway.  But I'd
like to hear some opinions from people who use and think about
valgrind more than I do (Tom, Andres, Noah, ...?).

> It might seem like my suppression is overly broad, or not broad
> enough, since it essentially targets LogicalTapeFreeze(). I don't
> think it is, though, because this can occur in two places within
> LogicalTapeFreeze() -- it can occur in the place we actually saw the
> issue on lousyjack, from the ltsReadBlock() call within
> LogicalTapeFreeze(), as well as a second place -- when
> BufFileExportShared() is called. I found that you have to tweak code
> to prevent it happening in the first place before you'll see it happen
> in the second place.

I don't quite see how that would happen, because BufFileExportShared,
at least AFAICS, doesn't touch the buffer?

Unfortunately valgrind does not work at all on my laptop -- the server
appears to start, but as soon as you try to connect, the whole thing
dies with an error claiming that the startup process has failed.  So I
can't easily test this at the moment.  I'll try to get it working,
here or elsewhere, but thought I'd send the above reply first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

05 February 2018, 21:03:41

On Mon, Feb 5, 2018 at 9:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> # LogicalTapeFreeze() may write out its first block when it is dirty but
> # not full, and then immediately read the first block back in from its
> # BufFile as a BLCKSZ-width block.  This can only occur in rare cases
> # where next to no tuples were written out, which is only possible with
> # parallel external tuplesorts.
>
> So, if I understand correctly what you're saying here, valgrind is
> totally cool with us writing out an only-partially-initialized block
> to a disk file, but it's got a real problem with us reading that data
> back into the same memory space it already occupies.

That's not quite it. Valgrind is cool with a BufFileWrite(), which
doesn't result in an actual write() because the buffile.c stdio-style
buffer (which isn't where the uninitialized bytes originate from)
isn't yet filled. The actual write() comes later, and that's the point
that Valgrind complains. IOW, Valgrind is cool with copying around
uninitialized memory before we do anything with the underlying values
(e.g., write(), something that affects control flow).

> I presume that it's common for the tail of the final block
> written to be uninitialized, but normally when we then go read block
> 0, that's some other, fully initialized block.

It certainly is common. In the case of logtape.c, we almost always
write out some garbage bytes, even with serial sorts. The only
difference here is the *sense* in which they're garbage: they're
uninitialized bytes, which Valgrind cares about, rather than byte from
previous writes that are left behind in the buffer, which Valgrind
does not care about.

>> It might seem like my suppression is overly broad, or not broad
>> enough, since it essentially targets LogicalTapeFreeze(). I don't
>> think it is, though, because this can occur in two places within
>> LogicalTapeFreeze() -- it can occur in the place we actually saw the
>> issue on lousyjack, from the ltsReadBlock() call within
>> LogicalTapeFreeze(), as well as a second place -- when
>> BufFileExportShared() is called. I found that you have to tweak code
>> to prevent it happening in the first place before you'll see it happen
>> in the second place.
>
> I don't quite see how that would happen, because BufFileExportShared,
> at least AFAICS, doesn't touch the buffer?

It doesn't have to -- at least not directly. Valgrind remembers that
the uninitialized memory from logtape.c buffers are poisoned -- it
"spreads". The knowledge that the bytes are poisoned is tracked as
they're copied around. You get the error on the write() from the
BufFile buffer, despite the fact that you can make the error go away
by using palloc0() instead of palloc() within logtape.c, and nowhere
else.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

06 February 2018, 00:27:49

On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> It certainly is common. In the case of logtape.c, we almost always
> write out some garbage bytes, even with serial sorts. The only
> difference here is the *sense* in which they're garbage: they're
> uninitialized bytes, which Valgrind cares about, rather than byte from
> previous writes that are left behind in the buffer, which Valgrind
> does not care about.

/me face-palms.

So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
on the buffer.  "We know what we're doing, trust us!"

In some ways, that seems better than inserting a suppression, because
it only affects the memory in the buffer.

Anybody else want to express an opinion here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree indexcreation)

From

"Tels"

Date:

06 February 2018, 00:39:07

On Mon, February 5, 2018 4:27 pm, Robert Haas wrote:
> On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> It certainly is common. In the case of logtape.c, we almost always
>> write out some garbage bytes, even with serial sorts. The only
>> difference here is the *sense* in which they're garbage: they're
>> uninitialized bytes, which Valgrind cares about, rather than byte from
>> previous writes that are left behind in the buffer, which Valgrind
>> does not care about.
>
> /me face-palms.
>
> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
> on the buffer.  "We know what we're doing, trust us!"
>
> In some ways, that seems better than inserting a suppression, because
> it only affects the memory in the buffer.
>
> Anybody else want to express an opinion here?

Are the uninitialized bytes that are written out "whatever was in the
memory previously" or just some "0x00 bytes from the allocation but not
yet overwritten from the PG code"?

Because the first sounds like it could be a security problem - if random
junk bytes go out to the disk, and stay there, information could
inadvertedly leak to permanent storage.

Best regards,

Tels

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 February 2018, 00:45:23

On Mon, Feb 5, 2018 at 1:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> It certainly is common. In the case of logtape.c, we almost always
>> write out some garbage bytes, even with serial sorts. The only
>> difference here is the *sense* in which they're garbage: they're
>> uninitialized bytes, which Valgrind cares about, rather than byte from
>> previous writes that are left behind in the buffer, which Valgrind
>> does not care about.

I should clarify what I meant here -- it is very common when we have
to freeze a tape, like when we do a serial external randomAccess
tuplesort, or a parallel worker's tuplesort. It shouldn't happen
otherwise. Note that there is a general pattern of dumping out the
current buffer just as the next one is needed, in order to make sure
that the linked list pointer correctly points to the
next/soon-to-be-current block. Note also that the majority of routines
declared within logtape.c can only be used on frozen tapes.

I am pretty confident that I've scoped this correctly by targeting
LogicalTapeFreeze().

> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
> on the buffer.  "We know what we're doing, trust us!"
>
> In some ways, that seems better than inserting a suppression, because
> it only affects the memory in the buffer.

I think that that would also work, and would be simpler, but also
slightly inferior to using the proposed suppression. If there is
garbage in logtape.c buffers, we still generally don't want to do
anything important on the basis of those values. We make one exception
with the suppression, which is a pretty typical kind of exception to
make -- don't worry if we write() poisoned bytes, since those are
bound to be alignment related.

OTOH, as I've said we are generally bound to write some kind of
logtape.c garbage, which will almost certainly not be of the
uninitialized memory variety. So, while I feel that the suppression is
better, the advantage is likely microscopic.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 February 2018, 00:57:34

On Mon, Feb 5, 2018 at 1:39 PM, Tels <nospam-abuse@bloodgate.com> wrote:
> Are the uninitialized bytes that are written out "whatever was in the
> memory previously" or just some "0x00 bytes from the allocation but not
> yet overwritten from the PG code"?
>
> Because the first sounds like it could be a security problem - if random
> junk bytes go out to the disk, and stay there, information could
> inadvertedly leak to permanent storage.

But you can say the same thing about *any* of the
write()-of-uninitialized-bytes Valgrind suppressions that already
exist. There are quite a few of those.

That just isn't part of our security model.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 February 2018, 21:33:45

On Mon, Feb 5, 2018 at 1:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
>> on the buffer.  "We know what we're doing, trust us!"
>>
>> In some ways, that seems better than inserting a suppression, because
>> it only affects the memory in the buffer.
>
> I think that that would also work, and would be simpler, but also
> slightly inferior to using the proposed suppression. If there is
> garbage in logtape.c buffers, we still generally don't want to do
> anything important on the basis of those values. We make one exception
> with the suppression, which is a pretty typical kind of exception to
> make -- don't worry if we write() poisoned bytes, since those are
> bound to be alignment related.
>
> OTOH, as I've said we are generally bound to write some kind of
> logtape.c garbage, which will almost certainly not be of the
> uninitialized memory variety. So, while I feel that the suppression is
> better, the advantage is likely microscopic.

Attached patch does it to the tail of the buffer, as Tom suggested on
the -committers thread.

Note that there is one other place in logtape.c that can write a
partial block like this: LogicalTapeRewindForRead(). I haven't
bothered to do anything there, since it cannot possibly be affected by
this issue for the same reason that serial sorts cannot be -- it's
code that is only used by a tuplesort that really needs to spill to
disk, and merge multiple runs (or for tapes that have already been
frozen, that are expected to never reallocate logtape.c buffers).

-- 
Peter Geoghegan

Attachment

0001-Mark-logtape.c-buffer-s-tail-as-defined-to-Valgrind.patch

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Tom Lane

Date:

06 February 2018, 22:11:36

Robert Haas <robertmhaas@gmail.com> writes:
> Unfortunately valgrind does not work at all on my laptop -- the server
> appears to start, but as soon as you try to connect, the whole thing
> dies with an error claiming that the startup process has failed.  So I
> can't easily test this at the moment.  I'll try to get it working,
> here or elsewhere, but thought I'd send the above reply first.

Do you want somebody who does have a working valgrind installation
(ie me) to take responsibility for pushing this patch?

            regards, tom lane

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

06 February 2018, 23:53:04

On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Unfortunately valgrind does not work at all on my laptop -- the server
>> appears to start, but as soon as you try to connect, the whole thing
>> dies with an error claiming that the startup process has failed.  So I
>> can't easily test this at the moment.  I'll try to get it working,
>> here or elsewhere, but thought I'd send the above reply first.
>
> Do you want somebody who does have a working valgrind installation
> (ie me) to take responsibility for pushing this patch?

I committed it before seeing this.  It probably would've been better
if you had done it, but I assume Peter tested it, so let's see what
the BF thinks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

06 February 2018, 23:56:20

On Tue, Feb 6, 2018 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this.  It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.

I did test it with a full "make installcheck" + valgrind-3.11.0. I'd
be very surprised if this doesn't make the buildfarm go green.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Tomas Vondra

Date:

07 February 2018, 00:04:10

On 02/06/2018 09:56 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Do you want somebody who does have a working valgrind installation
>>> (ie me) to take responsibility for pushing this patch?
>>
>> I committed it before seeing this.  It probably would've been better
>> if you had done it, but I assume Peter tested it, so let's see what
>> the BF thinks.
> 
> I did test it with a full "make installcheck" + valgrind-3.11.0. I'd
> be very surprised if this doesn't make the buildfarm go green.
> 

Did you do a test with "-O0"? In my experience that makes valgrind tests
much more reliable and repeatable. Some time ago we've seen cases that
were failing for me but not for others, and I suspect it was due to me
using "-O0".

(This is more a random comment than a suggestion that you patch won't
make the buildfarm green.)

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 February 2018, 00:14:26

On Tue, Feb 6, 2018 at 1:04 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Did you do a test with "-O0"? In my experience that makes valgrind tests
> much more reliable and repeatable. Some time ago we've seen cases that
> were failing for me but not for others, and I suspect it was due to me
> using "-O0".

FWIW, I use -O1 when configure is run for Valgrind. I also turn off
assertions (this is all scripted). According to the Valgrind manual:

"With -O1 line numbers in error messages can be inaccurate, although
generally speaking running Memcheck on code compiled at -O1 works
fairly well, and the speed improvement compared to running -O0 is
quite significant. Use of -O2 and above is not recommended as Memcheck
occasionally reports uninitialised-value errors which don’t really
exist."

The manual does also say that there might even be some problems with
-O1 at a later point, but it sounds like it's probably worth it to me.
Skink uses -Og, FWIW.

--
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Tomas Vondra

Date:

07 February 2018, 00:30:28


On 02/06/2018 10:14 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 1:04 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Did you do a test with "-O0"? In my experience that makes valgrind tests
>> much more reliable and repeatable. Some time ago we've seen cases that
>> were failing for me but not for others, and I suspect it was due to me
>> using "-O0".
> 
> FWIW, I use -O1 when configure is run for Valgrind. I also turn off
> assertions (this is all scripted). According to the Valgrind manual:
> 
> "With -O1 line numbers in error messages can be inaccurate, although
> generally speaking running Memcheck on code compiled at -O1 works
> fairly well, and the speed improvement compared to running -O0 is
> quite significant. Use of -O2 and above is not recommended as Memcheck
> occasionally reports uninitialised-value errors which don’t really
> exist."
> 

OK, although I was suggesting the optimizations may actually have the
opposite effect - valgrind missing some of the invalid memory accesses
(until the compiler decides not use them for some reason, causing sudden
valgrind failures).

> The manual does also say that there might even be some problems with
> -O1 at a later point, but it sounds like it's probably worth it to me.
> Skink uses -Og, FWIW.
> 

I have little idea what -Og exactly means. It seems to be focused on
debugging experience, and so still does some of the optimizations. Which
I think would explain why skink was not detecting some of the failures
for a long time.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

07 February 2018, 00:39:14

On Tue, Feb 6, 2018 at 1:30 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I have little idea what -Og exactly means. It seems to be focused on
> debugging experience, and so still does some of the optimizations.

As I understand it, -Og allows any optimization that does not hamper
walking through code with a debugger.

> Which
> I think would explain why skink was not detecting some of the failures
> for a long time.

I think that skink didn't detect failures until now because the code
wasn't exercised until parallel CREATE INDEX was added, simply because
the function LogicalTapeFreeze() was never reached (though that's not
the only reason, it is the most obvious one).

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Tomas Vondra

Date:

07 February 2018, 01:21:15


On 02/06/2018 10:39 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 1:30 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I have little idea what -Og exactly means. It seems to be focused on
>> debugging experience, and so still does some of the optimizations.
> 
> As I understand it, -Og allows any optimization that does not hamper
> walking through code with a debugger.
> 
>> Which
>> I think would explain why skink was not detecting some of the failures
>> for a long time.
> 
> I think that skink didn't detect failures until now because the code
> wasn't exercised until parallel CREATE INDEX was added, simply because
> the function LogicalTapeFreeze() was never reached (though that's not
> the only reason, it is the most obvious one).
> 

Maybe. What I had in mind was a different thread from November,
discussing some non-deterministic valgrind failures:


https://www.postgresql.org/message-id/flat/20171125200014.qbewtip5oydqsklt%40alap3.anarazel.de#20171125200014.qbewtip5oydqsklt@alap3.anarazel.de

But you're right that may be irrelevant here. As I said, it was mostly
just a random comment about valgrind.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

08 February 2018, 15:45:52

On Tue, Feb 6, 2018 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Unfortunately valgrind does not work at all on my laptop -- the server
>>> appears to start, but as soon as you try to connect, the whole thing
>>> dies with an error claiming that the startup process has failed.  So I
>>> can't easily test this at the moment.  I'll try to get it working,
>>> here or elsewhere, but thought I'd send the above reply first.
>>
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this.  It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.

skink and lousyjack seem happy now, so I think it worked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Prabhat Sahu

Date:

07 March 2018, 16:13:37

Hi all,

While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.

-- GUCs under postgres.conf

max_parallel_maintenance_workers = 16

max_parallel_workers = 16

max_parallel_workers_per_gather = 8

maintenance_work_mem = 8GB

max_wal_size = 4GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);

WARNING: terminating connection because of crash of another server process

DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

HINT: In a moment you should be able to reconnect to the database and repeat your command.

server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.

The connection to the server was lost. Attempting reset: Failed.

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

On Thu, Feb 8, 2018 at 6:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 6, 2018 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Unfortunately valgrind does not work at all on my laptop -- the server
>>> appears to start, but as soon as you try to connect, the whole thing
>>> dies with an error claiming that the startup process has failed. So I
>>> can't easily test this at the moment. I'll try to get it working,
>>> here or elsewhere, but thought I'd send the above reply first.
>>
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this. It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.

skink and lousyjack seem happy now, so I think it worked.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

07 March 2018, 16:46:36

On Wed, Mar 7, 2018 at 8:13 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:

Hi all,

While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.

-- GUCs under postgres.conf
max_parallel_maintenance_workers = 16
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 8GB
max_wal_size = 4GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

That makes it look like perhaps one of the worker backends crashed. Did you get a message in the logfile that might indicate the nature of the crash? Something with PANIC or TRAP, perhaps?

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Prabhat Sahu

Date:

07 March 2018, 16:59:53

On Wed, Mar 7, 2018 at 7:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 7, 2018 at 8:13 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
Hi all,

While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.

-- GUCs under postgres.conf
max_parallel_maintenance_workers = 16
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 8GB
max_wal_size = 4GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

That makes it look like perhaps one of the worker backends crashed. Did you get a message in the logfile that might indicate the nature of the crash? Something with PANIC or TRAP, perhaps?

I am not able to see any PANIC/TRAP in log file,

Here are the contents.

[edb@localhost bin]$ cat logsnew

2018-03-07 19:21:20.922 IST [54400] LOG: listening on IPv6 address "::1", port 5432

2018-03-07 19:21:20.922 IST [54400] LOG: listening on IPv4 address "127.0.0.1", port 5432

2018-03-07 19:21:20.925 IST [54400] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"

2018-03-07 19:21:20.936 IST [54401] LOG: database system was shut down at 2018-03-07 19:21:20 IST

2018-03-07 19:21:20.939 IST [54400] LOG: database system is ready to accept connections

2018-03-07 19:24:44.263 IST [54400] LOG: background worker "parallel worker" (PID 54482) was terminated by signal 9: Killed

2018-03-07 19:24:44.286 IST [54400] LOG: terminating any other active server processes

2018-03-07 19:24:44.297 IST [54405] WARNING: terminating connection because of crash of another server process

2018-03-07 19:24:44.297 IST [54405] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2018-03-07 19:24:44.297 IST [54405] HINT: In a moment you should be able to reconnect to the database and repeat your command.

2018-03-07 19:24:44.301 IST [54478] WARNING: terminating connection because of crash of another server process

2018-03-07 19:24:44.301 IST [54478] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2018-03-07 19:24:44.301 IST [54478] HINT: In a moment you should be able to reconnect to the database and repeat your command.

2018-03-07 19:24:44.494 IST [54504] FATAL: the database system is in recovery mode

2018-03-07 19:24:44.496 IST [54400] LOG: all server processes terminated; reinitializing

2018-03-07 19:24:44.513 IST [54505] LOG: database system was interrupted; last known up at 2018-03-07 19:22:54 IST

2018-03-07 19:24:44.552 IST [54505] LOG: database system was not properly shut down; automatic recovery in progress

2018-03-07 19:24:44.554 IST [54505] LOG: redo starts at 0/AB401A38

2018-03-07 19:25:14.712 IST [54505] LOG: invalid record length at 1/818B8D80: wanted 24, got 0

2018-03-07 19:25:14.714 IST [54505] LOG: redo done at 1/818B8D48

2018-03-07 19:25:14.714 IST [54505] LOG: last completed transaction was at log time 2018-03-07 19:24:05.322402+05:30

2018-03-07 19:25:16.887 IST [54400] LOG: database system is ready to accept connections

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

07 March 2018, 17:21:24

On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:

2018-03-07 19:24:44.263 IST [54400] LOG: background worker "parallel worker" (PID 54482) was terminated by signal 9: Killed

That looks like the background worker got killed by the OOM killer. How much memory do you have in the machine where this occurred?

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Tomas Vondra

Date:

08 March 2018, 04:16:52

On 03/07/2018 03:21 PM, Robert Haas wrote:
> On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu
> <prabhat.sahu@enterprisedb.com <mailto:prabhat.sahu@enterprisedb.com>>
> wrote:
> 
>     2018-03-07 19:24:44.263 IST [54400] LOG:  background worker
>     "parallel worker" (PID 54482) was terminated by signal 9: Killed
> 
> 
> That looks like the background worker got killed by the OOM killer.  How
> much memory do you have in the machine where this occurred?
>  

FWIW that's usually written to the system log. Does dmesg say something
about the kill?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Peter Geoghegan

Date:

08 March 2018, 04:40:18

On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> FWIW that's usually written to the system log. Does dmesg say something
> about the kill?

While it would be nice to confirm that it was indeed the OOM killer,
either way the crash happened because SIGKILL was sent to a parallel
worker. There is no reason to suspect a bug.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Andres Freund

Date:

08 March 2018, 04:42:10


On March 7, 2018 5:40:18 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> FWIW that's usually written to the system log. Does dmesg say
>something
>> about the kill?
>
>While it would be nice to confirm that it was indeed the OOM killer,
>either way the crash happened because SIGKILL was sent to a parallel
>worker. There is no reason to suspect a bug.

Not impossible there's a leak somewhere though.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Prabhat Sahu

Date:

08 March 2018, 18:53:05

On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
2018-03-07 19:24:44.263 IST [54400] LOG: background worker "parallel worker" (PID 54482) was terminated by signal 9: Killed

That looks like the background worker got killed by the OOM killer. How much memory do you have in the machine where this occurred?

I have ran the testcase in my local machine with below configurations:

Environment: CentOS 7(64bit)

HD : 100GB

RAM: 4GB

Processor: 4

I have nerrowdown the testcase as below, which also reproduce the same crash.

-- GUCs under postgres.conf

maintenance_work_mem = 8GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);

WARNING: terminating connection because of crash of another server process

DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

HINT: In a moment you should be able to reconnect to the database and repeat your command.

server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.

The connection to the server was lost. Attempting reset: Failed.

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

On Thu, Mar 8, 2018 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:

On March 7, 2018 5:40:18 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> FWIW that's usually written to the system log. Does dmesg say
>something
>> about the kill?
>
>While it would be nice to confirm that it was indeed the OOM killer,
>either way the crash happened because SIGKILL was sent to a parallel
>worker. There is no reason to suspect a bug.

Not impossible there's a leak somewhere though.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Tom Lane

Date:

08 March 2018, 19:45:06

Prabhat Sahu <prabhat.sahu@enterprisedb.com> writes:
> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That looks like the background worker got killed by the OOM killer.  How
>> much memory do you have in the machine where this occurred?

> I have ran the testcase in my local machine with below configurations:
> Environment: CentOS 7(64bit)
> HD : 100GB
> RAM: 4GB
> Processor: 4

If you only have 4GB of physical RAM, it hardly seems surprising that
trying to use 8GB of maintenance_work_mem would draw the wrath of the
OOM killer.

            regards, tom lane

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From

Robert Haas

Date:

08 March 2018, 19:57:49

On Thu, Mar 8, 2018 at 11:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Prabhat Sahu <prabhat.sahu@enterprisedb.com> writes:
>> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> That looks like the background worker got killed by the OOM killer.  How
>>> much memory do you have in the machine where this occurred?
>
>> I have ran the testcase in my local machine with below configurations:
>> Environment: CentOS 7(64bit)
>> HD : 100GB
>> RAM: 4GB
>> Processor: 4
>
> If you only have 4GB of physical RAM, it hardly seems surprising that
> trying to use 8GB of maintenance_work_mem would draw the wrath of the
> OOM killer.

Yup.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company