Thread: Parallel tuplesort (for parallel B-Tree index creation)

Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
As some of you know, I've been working on parallel sort. I think I've
gone as long as I can without feedback on the design (and I see that
we're accepting stuff for September CF now), so I'd like to share what
I came up with. This project is something that I've worked on
inconsistently since late last year. It can be thought of as the
Postgres 10 follow-up to the 9.6 work on external sorting.

Attached WIP patch series:

* Adds a parallel sorting capability to tuplesort.c.

* Adds a new client of this capability: btbuild()/nbtsort.c can now
create B-Trees in parallel.

Most of the complexity here relates to the first item; the tuplesort
module has been extended to support sorting in parallel. This is
usable in principle by every existing tuplesort caller, without any
restriction imposed by the newly expanded tuplesort.h interface. So,
for example, randomAccess MinimalTuple support has been added,
although it goes unused for now.

I went with CREATE INDEX as the first client of parallel sort in part
because the cost model and so on can be relatively straightforward.
Even CLUSTER uses the optimizer to determine if a sort strategy is
appropriate, and that would need to be taught about parallelism if its
tuplesort is to be parallelized. I suppose that I'll probably try to
get CLUSTER (with a tuplesort) done in the Postgres 10 development
cycle too, but not just yet.

For now, I would prefer to focus discussion on tuplesort itself. If
you can only look at one part of this patch, please look at the
high-level description of the interface/caller contract that was added
to tuplesort.h.

Performance
===========

Without further ado, I'll demonstrate how the patch series improves
performance in one case. This benchmark was run on an AWS server with
many disks. A d2.4xlarge instance was used, with 16 vCPUs, 122 GiB
RAM, 12 x 2 TB HDDs, running Amazon Linux. Apparently, this AWS
instance type can sustain 1,750 MB/second of I/O, which I was able to
verify during testing (when a parallel sequential scan ran, iotop
reported read throughput slightly above that for multi-second bursts).
Disks were configured in software RAID0. These instances have disks
that are optimized for sequential performance, which suits the patch
quite well. I don't usually trust AWS EC2 for performance testing, but
it seemed to work well here (results were pretty consistent).

Setup:

CREATE TABLE parallel_sort_test AS
    SELECT hashint8(i) randint,
    md5(i::text) collate "C" padding1,
    md5(i::text || '2') collate "C" padding2
    FROM generate_series(0, 1e9::bigint) i;

CHECKPOINT;

This leaves us with a parallel_sort_test table that is 94 GB in size.

SET maintenance_work_mem = '8GB';

-- Serial case (external sort, should closely match master branch):
CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 0);

Total time: 00:15:42.15

-- Patch with 8 tuplesort "sort-and-scan" workers (leader process
participates as a worker here):
CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 7);

Total time: 00:06:03.86

As you can see, the parallel case is 2.58x faster (while using more
memory, though it's not the case that a higher maintenance_work_mem
setting speeds up the serial/baseline index build). 8 workers are a
bit faster than 4, but not by much (not shown). 16 are a bit slower,
but not by much (not shown).

trace_sort output for "serial_idx" case:
"""
begin index sort: unique = f, workMem = 8388608, randomAccess = f
switching to external sort with 501 tapes: CPU 7.81s/25.54u sec
elapsed 33.95 sec
*** SNIP ***
performsort done (except 7-way final merge): CPU 53.52s/666.89u sec
elapsed 731.67 sec
external sort ended, 2443786 disk blocks used: CPU 74.40s/854.52u sec
elapsed 942.15 sec
"""

trace_sort output for "patch_8_idx" case:
"""
begin index sort: unique = f, workMem = 8388608, randomAccess = f
*** SNIP ***
sized memtuples 1.62x from worker's 130254158 (3052832 KB) to
210895910 (4942873 KB) for leader merge (0 KB batch memory conserved)
*** SNIP ***
tape -1/7 initially used 411907 KB of 430693 KB batch (0.956) and
26361986 out of 26361987 slots (1.000)
performsort done (except 8-way final merge): CPU 12.28s/101.76u sec
elapsed 129.01 sec
parallel external sort ended, 2443805 disk blocks used: CPU
30.08s/318.15u sec elapsed 363.86 sec
"""

This is roughly the degree of improvement that I expected when I first
undertook this project late last year. As I go into in more detail
below, I believe that we haven't exhausted all avenues to make
parallel CREATE INDEX faster still, but I do think what's left on the
table is not enormous.

There is less benefit when sorting on a C locale text attribute,
because the overhead of merging dominates parallel sorts, and that's
even more pronounced with text. So, many text cases tend to work out
at about only 2x - 2.2x faster. We could work on this indirectly.

I've seen cases where a CREATE INDEX ended up more than 3x faster,
though. I benchmarked this case in the interest of simplicity (the
serial case is intended to be comparable, making the test fair).
Encouragingly, as you can see from the trace_sort output, the 8
parallel workers are 5.67x faster at getting to the final merge (a
merge that even it performs serially). Note that the final merge for
each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8
workers). Not bad!

Design:  New, key concepts for tuplesort.c
==========================================

The heap is scanned in parallel, and worker processes also merge in
parallel if required (it isn't required in the example above). The
implementation makes heavy use of existing external sort
infrastructure. In fact, it's almost the case that the implementation
is a generalization of external sorting that allows workers to perform
heap scanning and run sorting independently, with tapes then "unified"
in the leader process for merging. At that point, the state held by
the leader is more or less consistent with the leader being a serial
external sort process that has reached its merge phase in the
conventional manner (serially).

The steps callers must take are described fully in tuplesort.h. The
general idea is that a Tuplesortstate is aware that it might not be a
self-contained sort; it may instead be one part of a parallel sort
operation. You might say that the tuplesort caller must "build its own
sort" from participant worker process Tuplesortstates. The caller
creates a dynamic shared memory segment + TOC for each parallel sort
operation (could be more than one concurrent sort operation, of
course), passes that to tuplesort to initialize and manage, and
creates a "leader" Tuplesortstate in private memory, plus one or more
"worker" Tuplesortstates, each presumably managed by a different
parallel worker process.

tuplesort.c does most of the heavy lifting, including having processes
wait on each other to respect its ordering dependencies. Caller is
responsible for spawning workers to do the work, reporting details of
the workers to tuplesort through shared memory, and having workers
call tuplesort to actually perform sorting. Caller consumes final
output through leader Tuplesortstate in leader process.

I think that this division of labor works well for us.

Tape unification
----------------

Sort operations have a unique identifier, generated before any workers
are launched, using a scheme based on the leader's PID, and a unique
temp file number. This makes all on-disk state (temp files managed by
logtape.c) discoverable by the leader process. State in shared memory
is sized in proportion to the number of workers, so the only thing
about the data being sorted that gets passed around in shared memory
is a little logtape.c metadata for tapes, describing for example how
large each constituent BufFile is (a BufFile associated with one
particular worker's tapeset).

(See below also for notes on buffile.c's role in all of this, fd.c and
resource management, etc.)

workMem
-------

Each worker process claims workMem as if it was an independent node.

The new implementation reuses much of what was originally designed for
external sorts. As such, parallel sorts are necessarily external
sorts, even when the workMem (i.e. maintenance_work_mem) budget could
in principle allow for parallel sorting to take place entirely in
memory. The implementation arguably *insists* on making such cases
external sorts, when they don't really need to be. This is much less
of a problem than you might think, since the 9.6 work on external
sorting does somewhat blur the distinction between internal and
external sorts (just consider how much time trace_sort indicates is
spent waiting on writes in workers; it's typically a small part of the
total time spent). Since parallel sort is really only compelling for
large sorts, it makes sense to make them external, or at least to
prioritize the cases that should be performed externally.

Anyway, workMem-not-exceeded cases require special handling to not
completely waste memory. Statistics about worker observations are used
at later stages, to at least avoid blatant waste, and to ensure that
memory is used optimally more generally.

Merging
=======

The model that I've come up with is that every worker process is
guaranteed to output one materialized run onto one tape for the leader
to merge within from its "unified" tapeset. This is the case
regardless of how much workMem is available, or any other factor. The
leader always assumes that the worker runs/tapes are present and
discoverable based only on the number of known-launched worker
processes, and a little metadata on each that is passed through shared
memory.

Producing one output run/materialized tape from all input tuples in a
worker often happens without the worker running out of workMem, which
you saw above. A straight quicksort and dump of all tuples is
therefore possible, without any merging required in the worker.
Alternatively, it may prove necessary to do some amount of merging in
each worker to generate one materialized output run. This case is
handled in the same way as a randomAccess case that requires one
materialized output tape to support random access by the caller. This
worker merging does necessitate another pass over all temp files for
the worker, but that's a much lower cost than you might imagine, in
part because the newly expanded use of batch memory makes merging here
cache efficient.

Batch allocation is used for all merging involved here, not just the
leader's own final-on-the-fly merge, so merging is consistently cache
efficient. (Workers that must merge on their own are therefore similar
to traditional randomAccess callers, so these cases become important
enough to optimize with the batch memory patch, although that's still
independently useful.)

No merging in parallel
----------------------

Currently, merging worker *output* runs may only occur in the leader
process. In other words, we always keep n worker processes busy with
scanning-and-sorting (and maybe some merging), but then all processes
but the leader process grind to a halt (note that the leader process
can participate as a scan-and-sort tuplesort worker, just as it will
everywhere else, which is why I specified "parallel_workers = 7" but
talked about 8 workers).

One leader process is kept busy with merging these n output runs on
the fly, so things will bottleneck on that, which you saw in the
example above. As already described, workers will sometimes merge in
parallel, but only their own runs -- never another worker's runs. I
did attempt to address the leader merge bottleneck by implementing
cross-worker run merging in workers. I got as far as implementing a
very rough version of this, but initial results were disappointing,
and so that was not pursued further than the experimentation stage.

Parallel merging is a possible future improvement that could be added
to what I've come up with, but I don't think that it will move the
needle in a really noticeable way.

Partitioning for parallelism (samplesort style "bucketing")
-----------------------------------------------------------

Perhaps a partition-based approach would be more effective than
parallel merging (e.g., redistribute slices of worker runs across
workers along predetermined partition boundaries, sort a range of
values within dedicated workers, then concatenate to get final result,
a bit like the in-memory samplesort algorithm). That approach would
not suit CREATE INDEX, because the approach's great strength is that
the workers can run in parallel for the entire duration, since there
is no merge bottleneck (this assumes good partition boundaries, which
is of a bit risky assumption). Parallel CREATE INDEX wants something
where the workers can independently write the index, and independently
WAL log, and independently create a unified set of internal pages, all
of which is hard.

This patch series will tend to proportionally speed up CREATE INDEX
statements at a level that is comparable to other major database
systems. That's enough progress for one release. I think that
partitioning to sort is more useful for query execution than for
utility statements like CREATE INDEX.

Partitioning and merge joins
----------------------------

Robert has often speculated about what it would take to make merge
joins work well in parallel. I think that "range
distribution"/bucketing will prove an important component of that.
It's just too useful to aggregate tuples in shared memory initially,
and have workers sort them without any serial merge bottleneck;
arguments about misestimations, data skew, and so on should not deter
us from this, long term. This approach has minimal IPC overhead,
especially with regard to LWLock contention.

This kind of redistribution probably belongs in a Gather-like node,
though, which has access to the context necessary to determine a
range, and even dynamically alter the range in the event of a
misestimation. Under this scheme, tuplesort.c just needs to be
instructed that these worker-private Tuplesortstates are
range-partitioned (i.e., the sorts are virtually independent, as far
as it's concerned). That's a bit messy, but it is still probably the
way to go for merge joins and other sort-reliant executor nodes.

buffile.c, and "unification"
============================

There has been significant new infrastructure added to make logtape.c
aware of workers. buffile.c has in turn been taught about unification
as a first class part of the abstraction, with low-level management of
certain details occurring within fd.c. So, "tape unification" within
processes to open other backend's logical tapes to generate a unified
logical tapeset for the leader to merge is added. This is probably the
single biggest source of complexity for the patch, since I must
consider:

* Creating a general, reusable abstraction for other possible BufFile
users (logtape.c only has to serve tuplesort.c, though).

* Logical tape free space management.

* Resource management, file lifetime, etc. fd.c resource management
can now close a file at xact end for temp files, while not deleting it
in the leader backend (only the "owning" worker backend deletes the
temp file it owns).

* Crash safety (e.g., when to truncate existing temp files, and when not to).

CREATE INDEX user interface
===========================

There are two ways of determine how many parallel workers a CREATE
INDEX requests:

* A cost model, which is closely based on create_plain_partial_paths()
at the moment. This needs more work, particularly to model things like
maintenance_work_mem. Even still, it isn't terrible.

* A parallel_workers storage parameter, which completely bypasses the
cost model. This is the "DBA knows best" approach, and is what I've
consistently used during testing.

Corey Huinker has privately assisted me with performance testing the
patch, using his own datasets. Testing has exclusively used the
storage parameter.

I've added a new GUC, max_parallel_workers_maintenance, which is
essentially the utility statement equivalent of
max_parallel_workers_per_gather. This is clearly necessary, since
we're using up to maintenance_work_mem per worker, which is of course
typically much higher than work_mem. I didn't feel the need to create
a new maintenance-wise variant GUC for things like
min_parallel_relation_size, though. Only this one new GUC is added
(plus the new storage parameter, parallel_workers, not to be confused
with the existing table storage parameter of the same name).

I am much more concerned about the tuplesort.h interface than the
CREATE INDEX user interface as such. The user interface is merely a
facade on top of tuplesort.c and nbtsort.c (and not one that I'm
particularly attached to).

--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Mon, Aug 1, 2016 at 6:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
> As some of you know, I've been working on parallel sort. I think I've
> gone as long as I can without feedback on the design (and I see that
> we're accepting stuff for September CF now), so I'd like to share what
> I came up with. This project is something that I've worked on
> inconsistently since late last year. It can be thought of as the
> Postgres 10 follow-up to the 9.6 work on external sorting.

I am glad that you are working on this.

Just a first thought after reading the email:

> As you can see, the parallel case is 2.58x faster (while using more
> memory, though it's not the case that a higher maintenance_work_mem
> setting speeds up the serial/baseline index build). 8 workers are a
> bit faster than 4, but not by much (not shown). 16 are a bit slower,
> but not by much (not shown).
...
> I've seen cases where a CREATE INDEX ended up more than 3x faster,
> though. I benchmarked this case in the interest of simplicity (the
> serial case is intended to be comparable, making the test fair).
> Encouragingly, as you can see from the trace_sort output, the 8
> parallel workers are 5.67x faster at getting to the final merge (a
> merge that even it performs serially). Note that the final merge for
> each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8
> workers). Not bad!

I'm not going to say it's bad to be able to do things 2-2.5x faster,
but linear scalability this ain't - particularly because your 2.58x
faster case is using up to 7 or 8 times as much memory.  The
single-process case would be faster in that case, too: you could
quicksort.  I feel like for sorting, in particular, we probably ought
to be setting the total memory budget, not the per-process memory
budget.  Or if not, then any CREATE INDEX benchmarking had better
compare using scaled values for maintenance_work_mem; otherwise,
you're measuring the impact of using more memory as much as anything
else.

I also think that Amdahl's law is going to pinch pretty severely here.
If the final merge phase is a significant percentage of the total
runtime, picking an algorithm that can't parallelize the final merge
is going to limit the speedups to small multiples.  That's an OK place
to be as a result of not having done all the work yet, but you don't
want to get locked into it.  If we're going to have a substantial
portion of the work that can never be parallelized, maybe we've picked
the wrong algorithm.

The work on making the logtape infrastructure parallel-aware seems
very interesting and potentially useful for other things.  Sadly, I
don't have time to look at it right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not going to say it's bad to be able to do things 2-2.5x faster,
> but linear scalability this ain't - particularly because your 2.58x
> faster case is using up to 7 or 8 times as much memory.  The
> single-process case would be faster in that case, too: you could
> quicksort.

Certainly, there are cases where a parallel version could benefit from
having more memory more so than actually parallelizing the underlying
task. However, this case was pointedly chosen to *not* be such a case.
When maintenance_work_mem exceeds about 5GB, I've observed that since
9.6 increasing it is just as likely to hurt as to help by about +/-5%
(unless and until it's all in memory, which still doesn't help much).
In general, there isn't all that much point in doing a very large sort
like this in memory. You just don't get that much of a benefit for the
memory you use, because linearithmic CPU costs eventually really
dominate linear sequential I/O costs.

I think you're focusing on the fact that there is a large absolute
disparity in memory used in this one benchmark, but that isn't
something that the gains shown particularly hinge upon. There isn't
that much difference when workers must merge their own runs, for
example. It saves the serial leader merge some work, and in particular
makes it more cache efficient (by having fewer runs/tapes).

Finally, while about 8x as much memory is used, the memory used over
and above the serial case is almost all freed when the final merge
begins (the final merges are therefore very similar in both cases,
including in terms of memory use). So, for as long as you use 8x as
much memory for 8 active processes, you get a 5.67x speed-up of that
part alone. You still keep a few extra KiBs of memory for worker tapes
and things like that during the leader's merge, but that's a close to
negligible amount.

> I feel like for sorting, in particular, we probably ought
> to be setting the total memory budget, not the per-process memory
> budget.  Or if not, then any CREATE INDEX benchmarking had better
> compare using scaled values for maintenance_work_mem; otherwise,
> you're measuring the impact of using more memory as much as anything
> else.

As I said, the benchmark was chosen to avoid that (and to be simple
and reproducible). I am currently neutral on the question of whether
or not maintenance_work_mem should be dolled out per process or per
sort operation. I do think that making it a per-process allowance is
far closer to what we do for hash joins today, and is simpler.

What's nice about the idea of making the workMem/maintenance_work_mem
budget per sort is that that leaves the leader process with license to
greatly increase the amount of memory it can use for the merge.
Increasing the amount of memory used for the merge will improve things
for longer than it will for workers. I've simulated it already.

> I also think that Amdahl's law is going to pinch pretty severely here.

Doesn't that almost always happen, though? Isn't that what you
generally see with queries that show off the parallel join capability?

> If the final merge phase is a significant percentage of the total
> runtime, picking an algorithm that can't parallelize the final merge
> is going to limit the speedups to small multiples.  That's an OK place
> to be as a result of not having done all the work yet, but you don't
> want to get locked into it.  If we're going to have a substantial
> portion of the work that can never be parallelized, maybe we've picked
> the wrong algorithm.

I suggest that this work be compared to something with similar
constraints. I used Google to try to get some indication of how much
of a difference parallel CREATE INDEX makes in other major database
systems. This is all I could find:

https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/

It seems like the degree of parallelism used for SQL Server tends to
affect index build time in a way that is strikingly similar with what
I've come up with (which may be a coincidence; I don't know anything
about SQL Server). So, I suspect that the performance of this is
fairly good in an apples-to-apples comparison.

Parallelizing merging can hurt or help, because there is a cost in
memory bandwidth (if not I/O) for the extra passes that are used to
keep more CPUs busy, which is kind of analogous to the situation with
polyphase merge. I'm not saying that we shouldn't do that even still,
but I believe that there are sharply diminishing returns. Merging
tuple comparisons are much more expensive than quicksort tuple
comparisons, which tend to benefit from abbreviated keys a lot.

As I've said, there is probably a good argument to be made for
partitioning to increase parallelism. But, that involves risks around
the partitioning being driven by statistics or a cost model, and I
don't think you'd be too on board with the idea of every CREATE INDEX
after bulk loading needing an ANALYZE first. I tend to think of that
as more of a parallel query thing, because you can often push down a
lot more there, dynamic sampling might be possible, and there isn't a
need to push all the tuples through one point in the end. Nothing I've
done here precludes your idea of a sort-order-preserving gather node.
I think that we may well need both.

Since merging is a big bottleneck with this, we should probably also
work to address that indirectly.

> The work on making the logtape infrastructure parallel-aware seems
> very interesting and potentially useful for other things.  Sadly, I
> don't have time to look at it right now.

I would be happy to look at generalizing that further, to help
parallel hash join. As you know, Thomas Munro and I have discussed
this privately.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Aug 3, 2016 at 5:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm not going to say it's bad to be able to do things 2-2.5x faster,
>> but linear scalability this ain't - particularly because your 2.58x
>> faster case is using up to 7 or 8 times as much memory.  The
>> single-process case would be faster in that case, too: you could
>> quicksort.
>
> [ lengthy counter-argument ]

None of this convinces me that testing this in a way that is not
"apples to apples" is a good idea, nor will any other argument.

>> I also think that Amdahl's law is going to pinch pretty severely here.
>
> Doesn't that almost always happen, though?

To some extent, sure, absolutely.  But it's our job as developers to
try to foresee and minimize those cases.  When Noah was at
EnterpriseDB a few years ago and we were talking about parallel
internal sort, Noah started by doing a survey of the literature and
identified parallel quicksort as the algorithm that seemed best for
our use case.  Of course, every time quicksort partitions the input,
you get two smaller sorting problems, so it's easy to see how to use 2
CPUs after the initial partitioning step has been completed and 4 CPUs
after each of those partitions has been partitioned again, and so on.
However, that turns out not to be good enough because the first
partitioning step can consume a significant percentage of the total
runtime - so if you only start parallelizing after that, you're
leaving too much on the table.  To avoid that, the algorithm he was
looking at had a (complicated) way of parallelizing the first
partitioning step; then you can, it seems, do the full sort in
parallel.

There are some somewhat outdated and perhaps naive ideas about this
that we wrote up here:

https://wiki.postgresql.org/wiki/Parallel_Sort

Anyway, you're proposing an algorithm that can't be fully
parallelized.  Maybe that's OK.  But I'm a little worried about it.
I'd feel more confident if we knew that the merge could be done in
parallel and were just leaving that to a later development stage; or
if we picked an algorithm like the one above that doesn't leave a
major chunk of the work unparallelizable.

> Isn't that what you
> generally see with queries that show off the parallel join capability?

For nested loop joins, no.  The whole join operation can be done in
parallel.  For hash joins, yes: building the hash table once per
worker can run afoul of Amdahl's law in a big way.  That's why Thomas
Munro is working on fixing it:

https://wiki.postgresql.org/wiki/EnterpriseDB_database_server_roadmap

Obviously, parallel query is subject to a long list of annoying
restrictions at this point.  On queries that don't hit any of those
restrictions we can get 4-5x speedup with a leader and 4 workers.  As
we expand the range of plan types that we can construct, I think we'll
see those kinds of speedups for a broader range of queries.  (The
question of exactly why we top out with as few workers as currently
seems to be the case needs more investigation, too; maybe contention
effects?)

>> If the final merge phase is a significant percentage of the total
>> runtime, picking an algorithm that can't parallelize the final merge
>> is going to limit the speedups to small multiples.  That's an OK place
>> to be as a result of not having done all the work yet, but you don't
>> want to get locked into it.  If we're going to have a substantial
>> portion of the work that can never be parallelized, maybe we've picked
>> the wrong algorithm.
>
> I suggest that this work be compared to something with similar
> constraints. I used Google to try to get some indication of how much
> of a difference parallel CREATE INDEX makes in other major database
> systems. This is all I could find:
>
> https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/

I do agree that it is important not to have unrealistic expectations.

> As I've said, there is probably a good argument to be made for
> partitioning to increase parallelism. But, that involves risks around
> the partitioning being driven by statistics or a cost model, and I
> don't think you'd be too on board with the idea of every CREATE INDEX
> after bulk loading needing an ANALYZE first. I tend to think of that
> as more of a parallel query thing, because you can often push down a
> lot more there, dynamic sampling might be possible, and there isn't a
> need to push all the tuples through one point in the end. Nothing I've
> done here precludes your idea of a sort-order-preserving gather node.
> I think that we may well need both.

Yes.  Rushabh is working on that, and Finalize GroupAggregate ->
Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking
pretty sweet.

>> The work on making the logtape infrastructure parallel-aware seems
>> very interesting and potentially useful for other things.  Sadly, I
>> don't have time to look at it right now.
>
> I would be happy to look at generalizing that further, to help
> parallel hash join. As you know, Thomas Munro and I have discussed
> this privately.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Aug 5, 2016 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> To some extent, sure, absolutely.  But it's our job as developers to
> try to foresee and minimize those cases.  When Noah was at
> EnterpriseDB a few years ago and we were talking about parallel
> internal sort, Noah started by doing a survey of the literature and
> identified parallel quicksort as the algorithm that seemed best for
> our use case.  Of course, every time quicksort partitions the input,
> you get two smaller sorting problems, so it's easy to see how to use 2
> CPUs after the initial partitioning step has been completed and 4 CPUs
> after each of those partitions has been partitioned again, and so on.
> However, that turns out not to be good enough because the first
> partitioning step can consume a significant percentage of the total
> runtime - so if you only start parallelizing after that, you're
> leaving too much on the table.  To avoid that, the algorithm he was
> looking at had a (complicated) way of parallelizing the first
> partitioning step; then you can, it seems, do the full sort in
> parallel.
>
> There are some somewhat outdated and perhaps naive ideas about this
> that we wrote up here:
>
> https://wiki.postgresql.org/wiki/Parallel_Sort

I'm familiar with that effort. I think that when research topics like
sorting, it can sometimes be a mistake to not look at an approach
specifically recommended by the database research community. A lot of
the techniques we've benefited from within tuplesort.c have been a
matter of addressing memory latency as a bottleneck; techniques that
are fairly simple and not worth writing a general interest paper on.
Also, things like abbreviated keys are beneficial in large part
because people tend to follow the first normal form, and therefore an
abbreviated key can contain a fair amount of entropy most of the time.
Similarly, Radix sort seems really cool, but our requirements around
generality seem to make it impractical.

> Anyway, you're proposing an algorithm that can't be fully
> parallelized.  Maybe that's OK.  But I'm a little worried about it.
> I'd feel more confident if we knew that the merge could be done in
> parallel and were just leaving that to a later development stage; or
> if we picked an algorithm like the one above that doesn't leave a
> major chunk of the work unparallelizable.

I might be able to resurrect the parallel merge stuff, just to guide
reviewer intuition on how much that can help or hurt. I can probably
repurpose it to show you the mixed picture on how effective it is. I
think it might help more with collatable text that doesn't have
abbreviated keys, for example, because you can use more of the
machines memory bandwidth for longer. But for integers, it can hurt.
(That's my recollection; I prototyped parallel merge a couple of
months ago now.)

>> Isn't that what you
>> generally see with queries that show off the parallel join capability?
>
> For nested loop joins, no.  The whole join operation can be done in
> parallel.

Sure, I know, but I'm suggesting that laws-of-physics problems may
still be more significant than implementation deficiencies, even
though those deficiencies should need to be stamped out. Linear
scalability is really quite rare for most database workloads.

> Obviously, parallel query is subject to a long list of annoying
> restrictions at this point.  On queries that don't hit any of those
> restrictions we can get 4-5x speedup with a leader and 4 workers.  As
> we expand the range of plan types that we can construct, I think we'll
> see those kinds of speedups for a broader range of queries.  (The
> question of exactly why we top out with as few workers as currently
> seems to be the case needs more investigation, too; maybe contention
> effects?)

You're probably bottlenecked on memory bandwidth. Note that I showed
improvements with 8 workers, not 4. 4 Workers are slower than 8, but
not by that much.

>> https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/
>
> I do agree that it is important not to have unrealistic expectations.

Great. My ambition for this patch is that it put parallel CREATE INDEX
on a competitive footing against the implementations featured in other
major systems. I don't think we need to do everything at once, but I
have no intention of pushing forward with something that doesn't do
respectably there. I also want to avoid partitioning in the first
version of this, and probably in any version that backs CREATE INDEX.
I've only made minimal changes to the tuplesort.h interface here to
support parallelism. That flexibility counts for a lot, IMV.

>> As I've said, there is probably a good argument to be made for
>> partitioning to increase parallelism. But, that involves risks around
>> the partitioning being driven by statistics or a cost model

> Yes.  Rushabh is working on that, and Finalize GroupAggregate ->
> Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking
> pretty sweet.

A "Gather Merge" node doesn't really sound like what I'm talking
about. Isn't that something to do with table-level partitioning? I'm
talking about dynamic partitioning, typically of a single table, of
course.

>>> The work on making the logtape infrastructure parallel-aware seems
>>> very interesting and potentially useful for other things.  Sadly, I
>>> don't have time to look at it right now.
>>
>> I would be happy to look at generalizing that further, to help
>> parallel hash join. As you know, Thomas Munro and I have discussed
>> this privately.
>
> Right.

By the way, the patch is in better shape from that perspective, as
compared to the early version Thomas (CC'd) had access to. The BufFile
stuff is now credible as a general-purpose abstraction.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sat, Aug 6, 2016 at 2:16 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Aug 5, 2016 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> There are some somewhat outdated and perhaps naive ideas about this
>> that we wrote up here:
>>
>> https://wiki.postgresql.org/wiki/Parallel_Sort
>
> I'm familiar with that effort. I think that when research topics like
> sorting, it can sometimes be a mistake to not look at an approach
> specifically recommended by the database research community. A lot of
> the techniques we've benefited from within tuplesort.c have been a
> matter of addressing memory latency as a bottleneck; techniques that
> are fairly simple and not worth writing a general interest paper on.
> Also, things like abbreviated keys are beneficial in large part
> because people tend to follow the first normal form, and therefore an
> abbreviated key can contain a fair amount of entropy most of the time.
> Similarly, Radix sort seems really cool, but our requirements around
> generality seem to make it impractical.
>
>> Anyway, you're proposing an algorithm that can't be fully
>> parallelized.  Maybe that's OK.  But I'm a little worried about it.
>> I'd feel more confident if we knew that the merge could be done in
>> parallel and were just leaving that to a later development stage; or
>> if we picked an algorithm like the one above that doesn't leave a
>> major chunk of the work unparallelizable.
>
> I might be able to resurrect the parallel merge stuff, just to guide
> reviewer intuition on how much that can help or hurt.
>

I think here some of the factors like how many workers will be used
for merge phase might impact the performance.   Having too many
workers can lead to more communication cost and having too few workers
might not yield best results for merge.  One thing, I have noticed
that in general for sorting, some of the other databases uses range
partitioning [1], now that might not be what is good for us.  I see
you mentioned above that why it is not good [2], but I don't
understand why you think it is a risky assumption to assume good
partition boundaries for parallelizing sort.


[1] -
https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
Refer Producer or Consumer Operations section.

[2] -
"That approach would not suit CREATE INDEX, because the approach's
great strength is that the workers can run in parallel for the entire
duration, since there is no merge bottleneck (this assumes good
partition boundaries, which is of a bit risky assumption)"


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sat, Aug 6, 2016 at 6:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think here some of the factors like how many workers will be used
> for merge phase might impact the performance.   Having too many
> workers can lead to more communication cost and having too few workers
> might not yield best results for merge.  One thing, I have noticed
> that in general for sorting, some of the other databases uses range
> partitioning [1], now that might not be what is good for us.

I don't disagree with anything you say here. I acknowledged that
partitioning will probably be important for sorting in my introductory
e-mail, after all.

> I see
> you mentioned above that why it is not good [2], but I don't
> understand why you think it is a risky assumption to assume good
> partition boundaries for parallelizing sort.

Well, apparently there are numerous problems with partitioning in
systems like SQL Server and Oracle in the worst case. For one thing,
in the event of a misestimation (or failure of the dynamic sampling
that I presume can sometimes be used), workers can be completely
starved of work for the entire duration of the sort. And for CREATE
INDEX to get much of any benefit, all workers must write their part of
the index independently, too. This can affect the physical structure
of the final index. SQL Server also has a caveat in its documentation
about this resulting in an unbalanced final index, which I imagine
could be quite bad in the worst case.

I believe that it's going to be hard to get any version of this that
writes the index simultaneously in each worker accepted for these
reasons. This patch I came up with isn't very different from the
serial case at all. Any index built in parallel by the patch ought to
have relfilenode files on the filesystem that are 100% identical to
those produced by the serial case, in fact (since CREATE INDEX does
not set LSNs in the new index pages). I've actually developed a simple
way of "fingerprinting" indexes during testing of this patch, knowing
that hashing the files on disk ought to produce a perfect match
compared to a master branch serial sort case.

At the same time, any information that I've seen about how much
parallel CREATE INDEX speeds things up in these other systems
indicates that the benefits are very similar. It tends to be in the 2x
- 3x range, with the same reduction in throughput seen at about 16
workers, after we peak at about 8 workers. So, I think that the
benefits of partitioning are not really seen with CREATE INDEX (I
think of partitioning as more of a parallel query thing). Obviously,
any benefit that might still exist for CREATE INDEX in particular,
when weighed against the costs, makes partitioning look pretty
unattractive as a next step.

I think that during the merge phase of parallel CREATE INDEX as
implemented, the system generally still isn't that far from being I/O
bound. Whereas, with parallel query, partitioning makes each worker
able to return one tuple from its own separated range very quickly,
not just one worker (presumably, each worker merges non-overlapping
"ranges" from runs initially sorted in each worker. Each worker
subsequently merges after a partition-wise redistribution of the
initial fully sorted runs, allowing for dynamic sampling to optimize
the actual range used for load balancing.). The workers can then do
more CPU-bound processing in whatever node is fed by each worker's
ranged merge; everything is kept busy. That's the approach that I
personally had in mind for partitioning, at least. It's really nice
for parallel query to be able to totally separate workers after the
point of redistribution. CREATE INDEX is not far from being I/O bound
anyway, though, so it benefits far less. (Consider how fast the merge
phase still is at writing out the index in *absolute* terms.)

Look at figure 9 in this paper: http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf

Even in good cases for "independent sorting", there is only a benefit
seen at 8 cores. At the same time, I can only get about 6x scaling
with 8 workers, just for the initial generation of runs.

All of these factors are why I believe I'm able to compete well with
other systems with this relatively straightforward, evolutionary
approach. I have a completely open mind about partitioning, but my
approach makes sense in this context.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Aug 3, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Since merging is a big bottleneck with this, we should probably also
> work to address that indirectly.

I attach a patch that changes how we maintain the heap invariant
during tuplesort merging. I already mentioned this over on the
"Parallel tuplesort, partitioning, merging, and the future" thread. As
noted already on that thread, this patch makes merging clustered
numeric input about 2.1x faster overall in one case, which is
particularly useful in the context of a serial final/leader merge
during a parallel CREATE INDEX. Even *random* non-C-collated text
input is made significantly faster. This work is totally orthogonal to
parallelism, though; it's just very timely, given our discussion of
the merge bottleneck on this thread.

If I benchmark a parallel build of a 100 million row index, with
presorted input, I can see a 71% reduction in *comparisons* with 8
tapes/workers, and an 80% reduction in comparisons with 16
workers/tapes in one instance (the numeric case I just mentioned).
With random input, we can still come out significantly ahead, but not
to the same degree. I was able to see a reduction in comparisons
during a leader merge, from 1,468,522,397 comparisons to 999,755,569
comparisons, which is obviously still quite significant (worker
merges, if any, benefit too). I think I need to redo my parallel
CREATE INDEX benchmark, so that you can take this into account. Also,
I think that this patch will make very large external sorts that
naturally have tens of runs to merge significantly faster, but I
didn't bother to benchmark that.

The patch is intended to be applied on top of parallel B-Tree patches
0001-* and 0002-* [1]. I happened to test it with parallelism, but
these are all independently useful, and will be entered as a separate
CF entry (perhaps better to commit the earlier two patches first, to
avoid merge conflicts). I'm optimistic that we can get those 3 patches
in the series out of the way early, without blocking on discussing
parallel sort.

The patch makes tuplesort merging shift down and displace the root
tuple with the tape's next preread tuple, rather than compacting and
then inserting into the heap anew. This approach to maintaining the
heap as tuples are returned to caller will always produce fewer
comparisons overall. The new approach is also simpler. We were already
shifting down to compact the heap within the misleadingly named [2]
function tuplesort_heap_siftup() -- why not instead just use the
caller tuple (the tuple that we currently go on to insert) when
initially shifting down (not the heap's preexisting last tuple, which
is guaranteed to go straight to the leaf level again)? That way, we
don't need to enlarge the heap at all through insertion, shifting up,
etc. We're done, and are *guaranteed* to have performed less work
(fewer comparisons and swaps) than with the existing approach (this is
the reason for my optimism about getting this stuff out of the way
early).

This new approach is more or less the *conventional* way to maintain
the heap invariant when returning elements from a heap during k-way
merging. Our existing approach is convoluted; merging was presumably
only coded that way because the generic functions
tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
available. Perhaps the problem was masked by unrelated bottlenecks
that existed at the time, too.

I think that I could push this further (a minimum of 2 comparisons per
item returned when 3 or more tapes are active still seems like 1
comparison too many), but what I have here gets us most of the
benefit. And, it does so while not actually adding code that could be
called "overly clever", IMV. I'll probably leave clever, aggressive
optimization of merging for a later release.

[1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAM3SWZQ+2gJMNV7ChxwEXqXopLfb_FEW2RfEXHJ+GsYF39f6MQ@mail.gmail.com
--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached WIP patch series:

This has bitrot, since commit da1c9163 changed the interface for
checking parallel safety. I'll have to fix that, and will probably
take the opportunity to change how workers have maintenance_work_mem
apportioned while I'm at it. To recap, it would probably be better if
maintenance_work_mem remained a high watermark for the entire CREATE
INDEX, rather than applying as a per-worker allowance.


-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 08/16/2016 03:33 AM, Peter Geoghegan wrote:
> I attach a patch that changes how we maintain the heap invariant
> during tuplesort merging. I already mentioned this over on the
> "Parallel tuplesort, partitioning, merging, and the future" thread. As
> noted already on that thread, this patch makes merging clustered
> numeric input about 2.1x faster overall in one case, which is
> particularly useful in the context of a serial final/leader merge
> during a parallel CREATE INDEX. Even *random* non-C-collated text
> input is made significantly faster. This work is totally orthogonal to
> parallelism, though; it's just very timely, given our discussion of
> the merge bottleneck on this thread.

Nice!

> The patch makes tuplesort merging shift down and displace the root
> tuple with the tape's next preread tuple, rather than compacting and
> then inserting into the heap anew. This approach to maintaining the
> heap as tuples are returned to caller will always produce fewer
> comparisons overall. The new approach is also simpler. We were already
> shifting down to compact the heap within the misleadingly named [2]
> function tuplesort_heap_siftup() -- why not instead just use the
> caller tuple (the tuple that we currently go on to insert) when
> initially shifting down (not the heap's preexisting last tuple, which
> is guaranteed to go straight to the leaf level again)? That way, we
> don't need to enlarge the heap at all through insertion, shifting up,
> etc. We're done, and are *guaranteed* to have performed less work
> (fewer comparisons and swaps) than with the existing approach (this is
> the reason for my optimism about getting this stuff out of the way
> early).

Makes sense.

> This new approach is more or less the *conventional* way to maintain
> the heap invariant when returning elements from a heap during k-way
> merging. Our existing approach is convoluted; merging was presumably
> only coded that way because the generic functions
> tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
> available. Perhaps the problem was masked by unrelated bottlenecks
> that existed at the time, too.

Yeah, this seems like a very obvious optimization. Is there a standard 
name for this technique in the literature? I'm OK with "displace", or 
perhaps just "replace" or "siftup+insert", but if there's a standard 
name for this, let's use that.

- Heikki




Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
I'm reviewing patches 1-3 in this series, i.e. those patches that are 
not directly related to parallelism, but are independent improvements to 
merging.

Let's begin with patch 1:

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> Cap the number of tapes used by external sorts
>
> Commit df700e6b set merge order based on available buffer space (the
> number of tapes was as high as possible while still allowing at least 32
> * BLCKSZ buffer space per tape), rejecting Knuth's theoretically
> justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's P),
> improving performance when the sort thereby completed in one pass.
> However, it's still true that there are unlikely to be benefits from
> increasing the number of tapes past 7 once the amount of data to be
> sorted significantly exceeds available memory; that commit probably
> mostly just improved matters where it enabled all merging to be done in
> a final on-the-fly merge.
>
> One problem with the merge order logic established by that commit is
> that with large work_mem settings and data volumes, the tapes previously
> wasted as much as 8% of the available memory budget; tens of thousands
> of tapes could be logically allocated for a sort that will only benefit
> from a few dozen.

Yeah, wasting 8% of the memory budget on this seems like a bad idea. If 
I understand correctly, that makes the runs shorter than necessary, 
leading to more runs.

> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
> This is a conservative estimate of the number of runs at which doing all
> merging on-the-fly no longer allows greater overlapping of I/O and
> computation.

Hmm. Surely there are cases, so that with > 501 tapes you could do it 
with one merge pass, but now you need two? And that would hurt 
performance, no?

Why do we reserve the buffer space for all the tapes right at the 
beginning? Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) 
callin inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every 
time we start a new run, until we reach maxTapes?

- Heikki




Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I attach a patch that changes how we maintain the heap invariant
>> during tuplesort merging.

> Nice!

Thanks!

>> This new approach is more or less the *conventional* way to maintain
>> the heap invariant when returning elements from a heap during k-way
>> merging. Our existing approach is convoluted; merging was presumably
>> only coded that way because the generic functions
>> tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
>> available. Perhaps the problem was masked by unrelated bottlenecks
>> that existed at the time, too.
>
>
> Yeah, this seems like a very obvious optimization. Is there a standard name
> for this technique in the literature? I'm OK with "displace", or perhaps
> just "replace" or "siftup+insert", but if there's a standard name for this,
> let's use that.

I used the term "displace" specifically because it wasn't a term with
a well-defined meaning in the context of the analysis of algorithms.
Just like "insert" isn't for tuplesort_heap_insert(). I'm not
particularly attached to the name tuplesort_heap_root_displace(), but
I do think that whatever it ends up being called should at least not
be named after an implementation detail. For example,
tuplesort_heap_root_replace() also seems fine.

I think that tuplesort_heap_siftup() should be called something like
tuplesort_heap_compact instead [1], since what it actually does
(shifting down -- the existing name is completely backwards!) is just
an implementation detail involved in compacting the heap (notice that
it decrements memtupcount, which, by now, means the k-way merge heap
gets one element smaller). I can write a patch to do this renaming, if
you're interested. Someone should fix it, because independent of all
this, it's just wrong.

[1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I attach a patch that changes how we maintain the heap invariant
>>> during tuplesort merging.
>
>> Nice!
>
> Thanks!

BTW, the way that k-way merging is made more efficient by this
approach makes the case for replacement selection even weaker than it
was just before we almost killed it. I hate to say it, but I have to
wonder if we shouldn't get rid of the new-to-9.6
replacement_sort_tuples because of this, and completely kill
replacement selection. I'm not going to go on about it, but that seems
sensible to me.


-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Claudio Freire
Date:
On Mon, Aug 15, 2016 at 9:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The patch is intended to be applied on top of parallel B-Tree patches
> 0001-* and 0002-* [1]. I happened to test it with parallelism, but
> these are all independently useful, and will be entered as a separate
> CF entry (perhaps better to commit the earlier two patches first, to
> avoid merge conflicts). I'm optimistic that we can get those 3 patches
> in the series out of the way early, without blocking on discussing
> parallel sort.

Applied patches 1 and 2, builds fine, regression tests run fine. It
was a prerequisite to reviewing patch 3 (which I'm going to do below),
so I thought I might as well report on that tidbit of info, fwiw.

> The patch makes tuplesort merging shift down and displace the root
> tuple with the tape's next preread tuple, rather than compacting and
> then inserting into the heap anew. This approach to maintaining the
> heap as tuples are returned to caller will always produce fewer
> comparisons overall. The new approach is also simpler. We were already
> shifting down to compact the heap within the misleadingly named [2]
> function tuplesort_heap_siftup() -- why not instead just use the
> caller tuple (the tuple that we currently go on to insert) when
> initially shifting down (not the heap's preexisting last tuple, which
> is guaranteed to go straight to the leaf level again)? That way, we
> don't need to enlarge the heap at all through insertion, shifting up,
> etc. We're done, and are *guaranteed* to have performed less work
> (fewer comparisons and swaps) than with the existing approach (this is
> the reason for my optimism about getting this stuff out of the way
> early).

Patch 3 applies fine to git master as of
25794e841e5b86a0f90fac7f7f851e5d950e51e2 (on top of patches 1 and 2).

Builds fine and without warnings on gcc 4.8.5 AFAICT, regression test
suite runs without issues as well.

Patch lacks any new tests, but the changed code paths seem covered
sufficiently by existing tests. A little bit of fuzzing on the patch
itself, like reverting some key changes, or flipping some key
comparisons, induces test failures as it should, mostly in cluster.

The logic in tuplesort_heap_root_displace seems sound, except:

+                */
+               memtuples[i] = memtuples[imin];
+               i = imin;
+       }
+
+       Assert(state->memtupcount > 1 || imin == 0);
+       memtuples[imin] = *newtup;
+}

Why that assert? Wouldn't it make more sense to Assert(imin < n) ?


In the meanwhile, I'll go and do some perf testing.

Assuming the speedup is realized during testing, LGTM.



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm reviewing patches 1-3 in this series, i.e. those patches that are not
> directly related to parallelism, but are independent improvements to
> merging.

That's fantastic! Thanks!

I'm really glad you're picking those ones up. I feel that I'm far too
dependent on Robert's review for this stuff. That shouldn't be taken
as a statement against Robert -- it's intended as quite the opposite
-- but it's just personally difficult to rely on exactly one other
person for something that I've put so much work into. Robert has been
involved with 100% of all sorting patches I've written, generally with
far less input from anyone else, and at this point, that's really
rather a lot of complex patches.

> Let's begin with patch 1:
>
> On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
>>
>> Cap the number of tapes used by external sorts

> Yeah, wasting 8% of the memory budget on this seems like a bad idea. If I
> understand correctly, that makes the runs shorter than necessary, leading to
> more runs.

Right. Quite simply, whatever you could have used the workMem for
prior to the merge step, now you can't. It's not so bad during the
merge step of a final on-the-fly merge (or, with the 0002-* patch, any
final merge), since you can get a "refund" of unused (though logically
allocated by USEMEM()) tapes to grow memtuples with (other overhead
forms the majority of the refund, though). That still isn't much
consolation to the user, because run generation is typically much more
expensive (we really just refund unused tapes because it's easy).

>> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
>> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
>> This is a conservative estimate of the number of runs at which doing all
>> merging on-the-fly no longer allows greater overlapping of I/O and
>> computation.
>
>
> Hmm. Surely there are cases, so that with > 501 tapes you could do it with
> one merge pass, but now you need two? And that would hurt performance, no?

In theory, yes, that could be true, and not just for my proposed new
cap of 500 for merge order (501 tapes), but for any such cap. I
noticed that the Greenplum tuplesort.c uses a max of 250, so I guess I
just thought that to double that. Way back in 2006, Tom and Simon
talked about a cap too on several occasions, but I think that that was
in the thousands then.

Hundreds of runs are typically quite rare. It isn't that painful to do
a second pass, because the merge process may be more CPU cache
efficient as a result, which tends to be the dominant cost these days
(over and above the extra I/O that an extra pass requires).

This seems like a very familiar situation to me: I pick a
quasi-arbitrary limit or cap for something, and it's not clear that
it's optimal. Everyone more or less recognizes the need for such a
cap, but is uncomfortable about the exact figure chosen, not because
it's objectively bad, but because it's clearly something pulled from
the air, to some degree. It may not make you feel much better about
it, but I should point out that I've read a paper that claims "Modern
servers of the day have hundreds of GB operating memory and tens of TB
storage capacity. Hence, if the sorted data fit the persistent
storage, the first phase will generate hundreds of runs at most." [1].

Feel free to make a counter-proposal for a cap. I'm not attached to
500. I'm mostly worried about blatant waste with very large workMem
sizings. Tens of thousands of tapes is just crazy. The amount of data
that you need to have as input is very large when workMem is big
enough for this new cap to be enforced.

> Why do we reserve the buffer space for all the tapes right at the beginning?
> Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin
> inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we
> start a new run, until we reach maxTapes?

No, because then you have no way to clamp back memory, which is now
almost all used (we hold off from making LACKMEM() continually true,
if at all possible, which is almost always the case). You can't really
continually shrink memtuples to make space for new tapes, which is
what it would take.

[1] http://ceur-ws.org/Vol-1343/paper8.pdf
-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 2:46 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Feel free to make a counter-proposal for a cap. I'm not attached to
> 500. I'm mostly worried about blatant waste with very large workMem
> sizings. Tens of thousands of tapes is just crazy. The amount of data
> that you need to have as input is very large when workMem is big
> enough for this new cap to be enforced.

If tuplesort callers passed a hint about the number of tuples that
would ultimately be sorted, and (for the sake of argument) it was
magically 100% accurate, then theoretically we could just allocate the
right number of tapes up-front. That discussion is a big can of worms,
though. There are of course obvious disadvantages that come with a
localized cost model, even if you're prepared to add some "slop" to
the allocation size or whatever.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Patch lacks any new tests, but the changed code paths seem covered
> sufficiently by existing tests. A little bit of fuzzing on the patch
> itself, like reverting some key changes, or flipping some key
> comparisons, induces test failures as it should, mostly in cluster.
>
> The logic in tuplesort_heap_root_displace seems sound, except:
>
> +                */
> +               memtuples[i] = memtuples[imin];
> +               i = imin;
> +       }
> +
> +       Assert(state->memtupcount > 1 || imin == 0);
> +       memtuples[imin] = *newtup;
> +}
>
> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?

There might only be one or two elements in the heap. Note that the
heap size is indicated by state->memtupcount at this point in the
sort, which is a little confusing (that differs from how memtupcount
is used elsewhere, where we don't partition memtuples into a heap
portion and a preread tuples portion, as we do here).

> In the meanwhile, I'll go and do some perf testing.
>
> Assuming the speedup is realized during testing, LGTM.

Thanks. I suggest spending at least as much time on unsympathetic
cases (e.g., only 2 or 3 tapes must be merged). At the same time, I
suggest focusing on a type that has relatively expensive comparisons,
such as collated text, to make differences clearer.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Claudio Freire
Date:
On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Patch lacks any new tests, but the changed code paths seem covered
>> sufficiently by existing tests. A little bit of fuzzing on the patch
>> itself, like reverting some key changes, or flipping some key
>> comparisons, induces test failures as it should, mostly in cluster.
>>
>> The logic in tuplesort_heap_root_displace seems sound, except:
>>
>> +                */
>> +               memtuples[i] = memtuples[imin];
>> +               i = imin;
>> +       }
>> +
>> +       Assert(state->memtupcount > 1 || imin == 0);
>> +       memtuples[imin] = *newtup;
>> +}
>>
>> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?
>
> There might only be one or two elements in the heap. Note that the
> heap size is indicated by state->memtupcount at this point in the
> sort, which is a little confusing (that differs from how memtupcount
> is used elsewhere, where we don't partition memtuples into a heap
> portion and a preread tuples portion, as we do here).

I noticed, but here n = state->memtupcount

+       Assert(memtuples[0].tupindex == newtup->tupindex);
+
+       CHECK_FOR_INTERRUPTS();
+
+       n = state->memtupcount;                 /* n is heap's size,
including old root */
+       imin = 0;                                               /*
start with caller's "hole" in root */
+       i = imin;

In fact, the assert on the patch would allow writing memtuples outside
the heap, as in calling tuplesort_heap_root_displace if
memtupcount==0, but I don't think that should be legal (memtuples[0]
== memtuples[imin] would be outside the heap).

Sure, that's a weird enough case (that assert up there already reads
memtuples[0] which would be equally illegal if memtupcount==0), but it
goes on to show that the assert expression just seems odd for its
intent.

BTW, I know it's not the scope of the patch, but shouldn't
root_displace be usable on the TSS_BOUNDED phase?



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I noticed, but here n = state->memtupcount
>
> +       Assert(memtuples[0].tupindex == newtup->tupindex);
> +
> +       CHECK_FOR_INTERRUPTS();
> +
> +       n = state->memtupcount;                 /* n is heap's size,
> including old root */
> +       imin = 0;                                               /*
> start with caller's "hole" in root */
> +       i = imin;

I'm fine with using "n" in the later assertion you mentioned, if
that's clearer to you. memtupcount is broken out as "n" simply because
that's less verbose, in a place where that makes things far clearer.

> In fact, the assert on the patch would allow writing memtuples outside
> the heap, as in calling tuplesort_heap_root_displace if
> memtupcount==0, but I don't think that should be legal (memtuples[0]
> == memtuples[imin] would be outside the heap).

You have to have a valid heap (i.e. there must be at least one
element) to call tuplesort_heap_root_displace(), and it doesn't
directly compact the heap, so it must remain valid on return. The
assertion exists to make sure that everything is okay with a
one-element heap, a case which is quite possible. If you want to see a
merge involving one input tape, apply the entire parallel CREATE INDEX
patch set, set "force_parallal_mode = regress", and note that the
leader merge merges only 1 input tape, making the heap only ever
contain one element. In general, most use of the heap for k-way
merging will eventually end up as a one element heap, at the very end.

Maybe that assertion you mention is overkill, but I like to err on the
side of overkill with assertions. It doesn't seem that important,
though.

> Sure, that's a weird enough case (that assert up there already reads
> memtuples[0] which would be equally illegal if memtupcount==0), but it
> goes on to show that the assert expression just seems odd for its
> intent.
>
> BTW, I know it's not the scope of the patch, but shouldn't
> root_displace be usable on the TSS_BOUNDED phase?

I don't think it should be, no. With a top-n heap sort, the
expectation is that after a little while, we can immediately determine
that most tuples do not belong in the heap (this will require more
than one comparison per tuple when the tuple that may be entered into
the heap will in fact go in the heap, which should be fairly rare
after a time). That's why that general strategy can be so much faster,
of course.

Note that that heap is "reversed" -- the sort order is inverted, so
that we can use a minheap. The top of the heap is the most marginal
tuple in the top-n heap so far, and so is the next to be removed from
consideration entirely (not the next to be returned to caller, when
merging).

Anyway, I just don't think that this is important enough to change --
it couldn't possibly be worth much of any risk. I can see the appeal
of consistency, but I also see the appeal of sticking to how things
work there: continually and explicitly inserting into and compacting
the heap seems like a good enough way of framing what a top-n heap
does, since there are no groupings of tuples (tapes) involved there.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Claudio Freire
Date:
On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I noticed, but here n = state->memtupcount
>>
>> +       Assert(memtuples[0].tupindex == newtup->tupindex);
>> +
>> +       CHECK_FOR_INTERRUPTS();
>> +
>> +       n = state->memtupcount;                 /* n is heap's size,
>> including old root */
>> +       imin = 0;                                               /*
>> start with caller's "hole" in root */
>> +       i = imin;
>
> I'm fine with using "n" in the later assertion you mentioned, if
> that's clearer to you. memtupcount is broken out as "n" simply because
> that's less verbose, in a place where that makes things far clearer.
>
>> In fact, the assert on the patch would allow writing memtuples outside
>> the heap, as in calling tuplesort_heap_root_displace if
>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>> == memtuples[imin] would be outside the heap).
>
> You have to have a valid heap (i.e. there must be at least one
> element) to call tuplesort_heap_root_displace(), and it doesn't
> directly compact the heap, so it must remain valid on return. The
> assertion exists to make sure that everything is okay with a
> one-element heap, a case which is quite possible.

More than using "n" or "memtupcount" what I'm saying is to assert that
memtuples[imin] is inside the heap, which would catch the same errors
the original assert would, and more.

Assert(imin < state->memtupcount)

If you prefer.

The original asserts allows any value of imin for memtupcount>1, and
that's my main concern. It shouldn't.


On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Sure, that's a weird enough case (that assert up there already reads
>> memtuples[0] which would be equally illegal if memtupcount==0), but it
>> goes on to show that the assert expression just seems odd for its
>> intent.
>>
>> BTW, I know it's not the scope of the patch, but shouldn't
>> root_displace be usable on the TSS_BOUNDED phase?
>
> I don't think it should be, no. With a top-n heap sort, the
> expectation is that after a little while, we can immediately determine
> that most tuples do not belong in the heap (this will require more
> than one comparison per tuple when the tuple that may be entered into
> the heap will in fact go in the heap, which should be fairly rare
> after a time). That's why that general strategy can be so much faster,
> of course.

I wasn't proposing getting rid of that optimization, but just
replacing the siftup+insert step with root_displace...

> Note that that heap is "reversed" -- the sort order is inverted, so
> that we can use a minheap. The top of the heap is the most marginal
> tuple in the top-n heap so far, and so is the next to be removed from
> consideration entirely (not the next to be returned to caller, when
> merging).

...but I didn't pause to consider that point.

It still looks like a valid optimization, instead rearranging the heap
twice (siftup + insert), do it once (replace + relocate).

However, I agree that it's not worth the risk conflating the two
optimizations. That one can be done later as a separate patch.



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 5:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> However, I agree that it's not worth the risk conflating the two
> optimizations. That one can be done later as a separate patch.

I'm rather fond of the assertions about tape number that exist within
root_displace currently. But, yeah, maybe.


-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 09/06/2016 10:42 PM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>> I attach a patch that changes how we maintain the heap invariant
>>>> during tuplesort merging.
>>
>>> Nice!
>>
>> Thanks!
>
> BTW, the way that k-way merging is made more efficient by this
> approach makes the case for replacement selection even weaker than it
> was just before we almost killed it.

This also makes the replacement selection cheaper, no?

> I hate to say it, but I have to
> wonder if we shouldn't get rid of the new-to-9.6
> replacement_sort_tuples because of this, and completely kill
> replacement selection. I'm not going to go on about it, but that seems
> sensible to me.

Yeah, perhaps. But that's a different story.

- Heikki




Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 10:28 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> BTW, the way that k-way merging is made more efficient by this
>> approach makes the case for replacement selection even weaker than it
>> was just before we almost killed it.
>
>
> This also makes the replacement selection cheaper, no?

Well, maybe, but the whole idea behind replacement_sort_tuples (by
which I mean the continued occasional use of replacement selection by
Postgres) was that we hope to avoid a merge step *entirely*. This new
merge shift down heap patch could make the merge step so cheap as to
be next to free anyway (in the even of presorted input), so the
argument for replacement_sort_tuples is weakened further. It might
always be cheaper once you factor in that the TSS_SORTEDONTAPE path
for returning tuples to caller happens to not be able to use batch
memory, even with something like collated text. And, as a bonus, you
get something that works just as well with an inverse correlation,
which was traditionally the worst case for replacement selection (it
makes it produce runs no larger than those produced by quicksort).

Anyway, I only mention this because it occurs to me. I have no desire
to go back to talking about replacement selection either. Maybe it's
useful to point this out, because it makes it clearer still that
severely limiting the use of replacement selection in 9.6 was totally
justified.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 10:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Well, maybe, but the whole idea behind replacement_sort_tuples (by
> which I mean the continued occasional use of replacement selection by
> Postgres) was that we hope to avoid a merge step *entirely*. This new
> merge shift down heap patch could make the merge step so cheap as to
> be next to free anyway (in the even of presorted input)

I mean: Cheaper than just processing the tuples to return to caller
without comparisons/merging (within the TSS_SORTEDONTAPE path). I do
not mean free in an absolute sense, of course.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 09/07/2016 12:46 AM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Why do we reserve the buffer space for all the tapes right at the beginning?
>> Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin
>> inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we
>> start a new run, until we reach maxTapes?
>
> No, because then you have no way to clamp back memory, which is now
> almost all used (we hold off from making LACKMEM() continually true,
> if at all possible, which is almost always the case). You can't really
> continually shrink memtuples to make space for new tapes, which is
> what it would take.

I still don't get it. When building the initial runs, we don't need 
buffer space for maxTapes yet, because we're only writing to a single 
tape at a time. An unused tape shouldn't take much memory. In 
inittapes(), when we have built all the runs, we know how many tapes we 
actually needed, and we can allocate the buffer memory accordingly.

[thinks a bit, looks at logtape.c]. Hmm, I guess that's wrong, because 
of the way this all is implemented. When we're building the initial 
runs, we're only writing to one tape at a time, but logtape.c 
nevertheless holds onto a BLCKSZ'd currentBuffer, plus one buffer for 
each indirect level, for every tape that has been used so far. What if 
we changed LogicalTapeRewind to free those buffers? Flush out the 
indirect buffers to disk, remembering just the physical block number of 
the topmost indirect block in memory, and free currentBuffer. That way, 
a tape that has been used, but isn't being read or written to at the 
moment, would take very little memory, and we wouldn't need to reserve 
space for them in the build-runs phase.

- Heikki




Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 10:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I still don't get it. When building the initial runs, we don't need buffer
> space for maxTapes yet, because we're only writing to a single tape at a
> time. An unused tape shouldn't take much memory. In inittapes(), when we
> have built all the runs, we know how many tapes we actually needed, and we
> can allocate the buffer memory accordingly.

Right. That's correct. But, we're not concerned about physically
allocated memory, but rather logically allocated memory (i.e., what
goes into USEMEM()). tuplesort.c should be able to fully use the
workMem specified by caller in the event of an external sort, just as
with an internal sort.

> [thinks a bit, looks at logtape.c]. Hmm, I guess that's wrong, because of
> the way this all is implemented. When we're building the initial runs, we're
> only writing to one tape at a time, but logtape.c nevertheless holds onto a
> BLCKSZ'd currentBuffer, plus one buffer for each indirect level, for every
> tape that has been used so far. What if we changed LogicalTapeRewind to free
> those buffers?

There isn't much point in that, because those buffers are never
physically allocated in the first place when there are thousands. They
are, however, entered into the tuplesort.c accounting as if they were,
denying tuplesort.c the full benefit of available workMem. It doesn't
matter if you USEMEM() or FREEMEM() after we first spill to disk, but
before we begin the merge. (We already refund the
unused-but-logically-allocated memory from unusued at the beginning of
the merge (within beginmerge()), so we can't do any better than we
already are from that point on -- that makes the batch memtuples
growth thing slightly more effective.)

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 10:57 PM, Peter Geoghegan <pg@heroku.com> wrote:
> There isn't much point in that, because those buffers are never
> physically allocated in the first place when there are thousands. They
> are, however, entered into the tuplesort.c accounting as if they were,
> denying tuplesort.c the full benefit of available workMem. It doesn't
> matter if you USEMEM() or FREEMEM() after we first spill to disk, but
> before we begin the merge. (We already refund the
> unused-but-logically-allocated memory from unusued at the beginning of
> the merge (within beginmerge()), so we can't do any better than we
> already are from that point on -- that makes the batch memtuples
> growth thing slightly more effective.)

The big picture here is that you can't only USEMEM() for tapes as the
need arises for new tapes as new runs are created. You'll just run a
massive availMem deficit, that you have no way of paying back, because
you can't "liquidate assets to pay off your creditors" (e.g., release
a bit of the memtuples memory). The fact is that memtuples growth
doesn't work that way. The memtuples array never shrinks.


-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 09/07/2016 09:01 AM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 10:57 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> There isn't much point in that, because those buffers are never
>> physically allocated in the first place when there are thousands. They
>> are, however, entered into the tuplesort.c accounting as if they were,
>> denying tuplesort.c the full benefit of available workMem. It doesn't
>> matter if you USEMEM() or FREEMEM() after we first spill to disk, but
>> before we begin the merge. (We already refund the
>> unused-but-logically-allocated memory from unusued at the beginning of
>> the merge (within beginmerge()), so we can't do any better than we
>> already are from that point on -- that makes the batch memtuples
>> growth thing slightly more effective.)
>
> The big picture here is that you can't only USEMEM() for tapes as the
> need arises for new tapes as new runs are created. You'll just run a
> massive availMem deficit, that you have no way of paying back, because
> you can't "liquidate assets to pay off your creditors" (e.g., release
> a bit of the memtuples memory). The fact is that memtuples growth
> doesn't work that way. The memtuples array never shrinks.

Hmm. But memtuples is empty, just after we have built the initial runs. 
Why couldn't we shrink, i.e. free and reallocate, it?

- Heikki




Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> The big picture here is that you can't only USEMEM() for tapes as the
>> need arises for new tapes as new runs are created. You'll just run a
>> massive availMem deficit, that you have no way of paying back, because
>> you can't "liquidate assets to pay off your creditors" (e.g., release
>> a bit of the memtuples memory). The fact is that memtuples growth
>> doesn't work that way. The memtuples array never shrinks.
>
>
> Hmm. But memtuples is empty, just after we have built the initial runs. Why
> couldn't we shrink, i.e. free and reallocate, it?

After we've built the initial runs, we do in fact give a FREEMEM()
refund to those tapes that were not used within beginmerge(), as I
mentioned just now (with a high workMem, this is often the great
majority of many thousands of logical tapes -- that's how you get to
wasting 8% of 5GB of maintenance_work_mem).

What's at issue with this 500 tapes cap patch is what happens after
tuples are first dumped (after we decide that this is going to be an
external sort -- where we call tuplesort_merge_order() to get the
number of logical tapes in the tapeset), but before the final merge
happens, where we're already doing the right thing for merging by
giving that refund. I want to stop logical allocation (USEMEM()) of an
enormous number of tapes, to make run generation itself able to use
more memory.

It's surprisingly difficult to do something cleverer than just impose a cap.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 09/07/2016 09:17 AM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> The big picture here is that you can't only USEMEM() for tapes as the
>>> need arises for new tapes as new runs are created. You'll just run a
>>> massive availMem deficit, that you have no way of paying back, because
>>> you can't "liquidate assets to pay off your creditors" (e.g., release
>>> a bit of the memtuples memory). The fact is that memtuples growth
>>> doesn't work that way. The memtuples array never shrinks.
>>
>>
>> Hmm. But memtuples is empty, just after we have built the initial runs. Why
>> couldn't we shrink, i.e. free and reallocate, it?
>
> After we've built the initial runs, we do in fact give a FREEMEM()
> refund to those tapes that were not used within beginmerge(), as I
> mentioned just now (with a high workMem, this is often the great
> majority of many thousands of logical tapes -- that's how you get to
> wasting 8% of 5GB of maintenance_work_mem).

I & peter chatted over IM on this. Let me try to summarize the problems, 
and my plan:

1. When we start to build the initial runs, we currently reserve memory 
for tape buffers, maxTapes * TAPE_BUFFER_OVERHEAD. But we only actually 
need the buffers for tapes that are really used. We "refund" the buffers 
for the unused tapes after we've built the initial runs, but we're still 
wasting that while building the initial runs. We didn't actually 
allocate it, but we could've used it for other things. Peter's solution 
to this was to put a cap on maxTapes.

2. My observation is that during the build-runs phase, you only actually 
need those tape buffers for the one tape you're currently writing to. 
When you switch to a different tape, you could flush and free the 
buffers for the old tape. So reserving maxTapes * TAPE_BUFFER_OVERHEAD 
is excessive, 1 * TAPE_BUFFER_OVERHEAD would be enough. logtape.c 
doesn't have an interface for doing that today, but it wouldn't be hard 
to add.

3. If we do that, we'll still have to reserve the tape buffers for all 
the tapes that we use during merge. So after we've built the initial 
runs, we'll need to reserve memory for those buffers. That might require 
shrinking memtuples. But that's OK: after building the initial runs, 
memtuples is empty, so we can shrink it.

- Heikki



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Claudio Freire
Date:
On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> In the meanwhile, I'll go and do some perf testing.
>>
>> Assuming the speedup is realized during testing, LGTM.
>
> Thanks. I suggest spending at least as much time on unsympathetic
> cases (e.g., only 2 or 3 tapes must be merged). At the same time, I
> suggest focusing on a type that has relatively expensive comparisons,
> such as collated text, to make differences clearer.

The tests are still running (the benchmark script I came up with runs
for a lot longer than I anticipated, about 2 days), but preliminar
results are very promising, I can see a clear and consistent speedup.
We'll have to wait for the complete results to see if there's any
significant regression, though. I'll post the full results when I have
them, but until now it all looks like this:

setup:

create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
insert into lotsofitext select cast(random() * 1000000000.0 as text)
|| 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
'blablabl
awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
generate_series(1, 10000000);

timed:

select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;

Unpatched Time: 100351.251 ms
Patched Time: 75180.787 ms

That's like a 25% speedup on random input. As we say over here, rather
badly translated, not a turkey's boogers (meaning "nice!")


On Tue, Sep 6, 2016 at 9:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> I noticed, but here n = state->memtupcount
>>>
>>> +       Assert(memtuples[0].tupindex == newtup->tupindex);
>>> +
>>> +       CHECK_FOR_INTERRUPTS();
>>> +
>>> +       n = state->memtupcount;                 /* n is heap's size,
>>> including old root */
>>> +       imin = 0;                                               /*
>>> start with caller's "hole" in root */
>>> +       i = imin;
>>
>> I'm fine with using "n" in the later assertion you mentioned, if
>> that's clearer to you. memtupcount is broken out as "n" simply because
>> that's less verbose, in a place where that makes things far clearer.
>>
>>> In fact, the assert on the patch would allow writing memtuples outside
>>> the heap, as in calling tuplesort_heap_root_displace if
>>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>>> == memtuples[imin] would be outside the heap).
>>
>> You have to have a valid heap (i.e. there must be at least one
>> element) to call tuplesort_heap_root_displace(), and it doesn't
>> directly compact the heap, so it must remain valid on return. The
>> assertion exists to make sure that everything is okay with a
>> one-element heap, a case which is quite possible.
>
> More than using "n" or "memtupcount" what I'm saying is to assert that
> memtuples[imin] is inside the heap, which would catch the same errors
> the original assert would, and more.
>
> Assert(imin < state->memtupcount)
>
> If you prefer.
>
> The original asserts allows any value of imin for memtupcount>1, and
> that's my main concern. It shouldn't.

So, for the assertions to properly avoid clobbering/reading out of
bounds memory, you need both the above assert:
+                */+               memtuples[i] = memtuples[imin];+               i = imin;+       }+
>+       Assert(imin < state->memtupcount);+       memtuples[imin] = *newtup;+}

And another one at the beginning, asserting:
+       SortTuple  *memtuples = state->memtuples;+       int             n,+                               imin,+
                       i;+
 
>+       Assert(state->memtupcount > 0 && memtuples[0].tupindex == newtup->tupindex);++       CHECK_FOR_INTERRUPTS();

It's worth making that change, IMHO, unless I'm missing something.



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> setup:
>
> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
> insert into lotsofitext select cast(random() * 1000000000.0 as text)
> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
> 'blablabl
> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
> generate_series(1, 10000000);
>
> timed:
>
> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>
> Unpatched Time: 100351.251 ms
> Patched Time: 75180.787 ms
>
> That's like a 25% speedup on random input. As we say over here, rather
> badly translated, not a turkey's boogers (meaning "nice!")

Cool! What work_mem setting were you using here?

>> More than using "n" or "memtupcount" what I'm saying is to assert that
>> memtuples[imin] is inside the heap, which would catch the same errors
>> the original assert would, and more.
>>
>> Assert(imin < state->memtupcount)
>>
>> If you prefer.
>>
>> The original asserts allows any value of imin for memtupcount>1, and
>> that's my main concern. It shouldn't.
>
> So, for the assertions to properly avoid clobbering/reading out of
> bounds memory, you need both the above assert:
>
>  +                */
>  +               memtuples[i] = memtuples[imin];
>  +               i = imin;
>  +       }
>  +
>>+       Assert(imin < state->memtupcount);
>  +       memtuples[imin] = *newtup;
>  +}
>
> And another one at the beginning, asserting:
>
>  +       SortTuple  *memtuples = state->memtuples;
>  +       int             n,
>  +                               imin,
>  +                               i;
>  +
>>+       Assert(state->memtupcount > 0 && memtuples[0].tupindex == newtup->tupindex);
>  +
>  +       CHECK_FOR_INTERRUPTS();
>
> It's worth making that change, IMHO, unless I'm missing something.

You're supposed to just not call it with an empty heap, so the
assertions trust that much. I'll look into that.

Currently, producing a new revision of this entire patchset. Improving
the cost model (used when the parallel_workers storage parameter is
not specified within CREATE INDEX) is taking a bit of time, but hope
to have it out in the next couple of days.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Claudio Freire
Date:
On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> setup:
>>
>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
>> insert into lotsofitext select cast(random() * 1000000000.0 as text)
>> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
>> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
>> 'blablabl
>> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
>> generate_series(1, 10000000);
>>
>> timed:
>>
>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>>
>> Unpatched Time: 100351.251 ms
>> Patched Time: 75180.787 ms
>>
>> That's like a 25% speedup on random input. As we say over here, rather
>> badly translated, not a turkey's boogers (meaning "nice!")
>
> Cool! What work_mem setting were you using here?

The script iterates over a few variations of string patterns (easy
comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB,
4GB), and table sizes (~350M, ~650M, ~1.5G).

That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Sep 8, 2016 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.

Cool. I wonder where this leaves Heikki's draft patch, that completely
removes batch memory, etc.



-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Sep 7, 2016 at 2:36 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> 3. If we do that, we'll still have to reserve the tape buffers for all the
> tapes that we use during merge. So after we've built the initial runs, we'll
> need to reserve memory for those buffers. That might require shrinking
> memtuples. But that's OK: after building the initial runs, memtuples is
> empty, so we can shrink it.

Do you really think all this is worth the effort? Given how things are
going to improve for merging anyway, I tend to doubt it. I'd rather
just apply the cap (not necessarily 501 tapes, but something), and be
done with it. As you know, Knuth never advocated more than 7 tapes at
once, which I don't think had anything to do with the economics of
tape drives in the 1970s (or problems with tape operators getting
repetitive strange injuries). There is a chart in volume 3 about this.
Senior hackers talked about a cap like this from day one, back in
2006, when Simon and Tom initially worked on scaling the number of
tapes. Alternatively, we could make MERGE_BUFFER_SIZE much larger,
which I think would be a good idea independent of whatever waste
logically allocation of never-used tapes presents us with. It's
currently 1/4 of 1MiB, which is hardly anything these days, and
doesn't seem to have much to do with OS read ahead trigger sizes.

If we were going to do something like you describe here, I'd prefer it
to be driven by an observable benefit in performance, rather than a
theoretical benefit. Not doing everything in one pass isn't
necessarily worse than having a less cache efficient heap -- it might
be quite a bit better, in fact. You've seen how hard it can be to get
a sort that is I/O bound. (Sorting will tend to not be completely I/O
bound, unless perhaps parallelism is used).

Anyway, this patch (patch 0001-*) is by far the least important of the
3 that you and Claudio are signed up to review. I don't think it's
worth bending over backwards to do better. If you're not comfortable
with a simple cap like this, than I'd suggest that we leave it at
that, since our time is better spent elsewhere. We can just shelve it
for now -- "returned with feedback". I wouldn't make any noise about
it (although, I actually don't think that the cap idea is at all
controversial).

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Claudio Freire
Date:
On Thu, Sep 8, 2016 at 2:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> setup:
>>>
>>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
>>> insert into lotsofitext select cast(random() * 1000000000.0 as text)
>>> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
>>> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
>>> 'blablabl
>>> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
>>> generate_series(1, 10000000);
>>>
>>> timed:
>>>
>>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>>>
>>> Unpatched Time: 100351.251 ms
>>> Patched Time: 75180.787 ms
>>>
>>> That's like a 25% speedup on random input. As we say over here, rather
>>> badly translated, not a turkey's boogers (meaning "nice!")
>>
>> Cool! What work_mem setting were you using here?
>
> The script iterates over a few variations of string patterns (easy
> comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB,
> 4GB), and table sizes (~350M, ~650M, ~1.5G).
>
> That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.

Well, the worst regression I see is under the noise for this test
(which seems rather high at 5%, but it's expectable since it's mostly
big queries).

Most samples show an improvement, either marginal or significant. The
most improvement is, naturally, on low work_mem settings. I don't see
significant slowdown on work_mem settings that should result in just a
few tapes being merged, but I didn't instrument to check how many
tapes were being merged in any case.

Attached are the results both in ods, csv and raw formats.

I think these are good results.


So, to summarize the review:

- Patch seems to follow the coding conventions of surrounding code
- Applies cleanly on top of25794e841e5b86a0f90fac7f7f851e5d950e51e2,
plus patches 1 and 2.
- Builds without warnings
- Passes regression tests
- IMO has sufficient coverage from existing tests (none added)
- Does not introduce any significant performance regression
- Best improvement of 67% (reduction of runtime to 59%)
- Average improvement of 30% (reduction of runtime to 77%)
- Worst regression of 5% (increase of runtime to 105%), which is under
the noise for control queries, so not significant
- Performance improvement is highly quite desirable in this merge
step, as it's a big bottleneck on parallel sort (and seems also
regular sort)
- All testing was done on random input, presorted input *will* show
more pronounced improvements

I suggested to change a few asserts in tuplesort_heap_root_displace to
make the debug code stricter in checking the assumptions, but they're
not blockers:

+       Assert(state->memtupcount > 1 || imin == 0);
+       memtuples[imin] = *newtup;

Into

+       Assert(imin < state->memtupcount);
+       memtuples[imin] = *newtup;

And, perhaps as well,

+       Assert(memtuples[0].tupindex == newtup->tupindex);
+
+       CHECK_FOR_INTERRUPTS();

into

+       Assert(state->memtupcount > 0 && memtuples[0].tupindex ==
newtup->tupindex);
+
+       CHECK_FOR_INTERRUPTS();


It was suggested that both tuplesort_heap_siftup and
tuplesort_heap_root_displace could be wrappers around a common
"siftup" implementation, since the underlying operation is very
similar.

Since it is true that doing so would make it impossible to keep the
asserts about tupindex in tuplesort_heap_root_displace, I guess it
depends on how useful those asserts are (ie: how likely it is that
those conditions could be violated, and how damaging it could be if
they were). If it is decided the refactor is desirable, I'd suggest
making the common siftup producedure static inline, to allow
tuplesort_heap_root_displace to inline and specialize it, since it
will be called with checkIndex=False and that simplifies the resulting
code considerably.

Peter also mentioned that there were some other changes going on in
the surrounding code that could impact this patch, so I'm marking the
patch Waiting on Author.

Overall, however, I believe the patch is in good shape. Only minor
form issues need to be changed, the functionality seems both desirable
and ready.

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Claudio Freire
Date:
...
On Fri, Sep 9, 2016 at 9:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Since it is true that doing so would make it impossible to keep the
> asserts about tupindex in tuplesort_heap_root_displace, I guess it
> depends on how useful those asserts are (ie: how likely it is that
> those conditions could be violated, and how damaging it could be if
> they were). If it is decided the refactor is desirable, I'd suggest
> making the common siftup producedure static inline, to allow
> tuplesort_heap_root_displace to inline and specialize it, since it
> will be called with checkIndex=False and that simplifies the resulting
> code considerably.
>
> Peter also mentioned that there were some other changes going on in
> the surrounding code that could impact this patch, so I'm marking the
> patch Waiting on Author.
>
> Overall, however, I believe the patch is in good shape. Only minor
> form issues need to be changed, the functionality seems both desirable
> and ready.


Sorry, forgot to specify, that was all about patch 3, the one about
tuplesort_heap_root_displace.



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Sep 9, 2016 at 5:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Since it is true that doing so would make it impossible to keep the
> asserts about tupindex in tuplesort_heap_root_displace, I guess it
> depends on how useful those asserts are (ie: how likely it is that
> those conditions could be violated, and how damaging it could be if
> they were). If it is decided the refactor is desirable, I'd suggest
> making the common siftup producedure static inline, to allow
> tuplesort_heap_root_displace to inline and specialize it, since it
> will be called with checkIndex=False and that simplifies the resulting
> code considerably.

Right. I want to keep it as a separate function for all these reasons.
I also think that I'll end up further optimizing what I've called
tuplesort_heap_root_displace in the future, to adopt to clustered
input. I'm thinking of something like Timsort's "galloping mode". What
I've come up with here still needs 2 comparisons and a swap per call
for presorted input. There is still a missed opportunity for clustered
or (inverse) correlated input -- we can make merging opportunistically
skip ahead to determine that the root tape's 100th tuple (say) would
still fit in the root position of the merge minheap. So, immediately
return 100 tuples from the root's tape without bothering to compare
them to anything. Do a binary search to find the best candidate
minheap root before the 100th tuple if a guess of 100 doesn't work
out. Adapt to trends. Stuff like that.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 09/10/2016 03:22 AM, Claudio Freire wrote:
> Overall, however, I believe the patch is in good shape. Only minor
> form issues need to be changed, the functionality seems both desirable
> and ready.

Pushed this "displace root" patch, with some changes:

* I renamed "tuplesort_heap_siftup()" to "tuplesort_delete_top()". I 
realize that this is controversial, per the discussion on the "Is 
tuplesort_heap_siftup() a misnomer?" thread. However, now that we have a 
new function, "tuplesort_heap_replace_top()", which is exactly the same 
algorithm as the "delete_top()" algorithm, calling one of them "siftup" 
became just too confusing. If anything, the new "replace_top" 
corresponds more closely to Knuth's siftup algorithm; delete-top is a 
special case of it. I added a comment on that to replace_top. I hope 
everyone can live with this.

* Instead of "root_displace", I used the name "replace_top", and 
"delete_top" for the old siftup function. Because we use "top" to refer 
to memtuples[0] more commonly than "root", in the existing comments.

* I shared the code between the delete-top and replace-top. Delete-top 
now calls the replace-top function, with the last element of the heap. 
Both functions have the same signature, i.e. they both take the 
checkIndex argument. Peter's patch left that out for the "replace" 
function, on performance grounds, but if that's worthwhile, that seems 
like a separate optimization. Might be worth benchmarking that 
separately, but I didn't want to conflate that with this patch.

* I replaced a few more siftup+insert calls with the new combined 
replace-top operation. Because why not.

Thanks for the patch, Peter, and thanks for the review, Claudio!

- Heikki




Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> * I renamed "tuplesort_heap_siftup()" to "tuplesort_delete_top()". I realize
> that this is controversial, per the discussion on the "Is
> tuplesort_heap_siftup() a misnomer?" thread. However, now that we have a new
> function, "tuplesort_heap_replace_top()", which is exactly the same
> algorithm as the "delete_top()" algorithm, calling one of them "siftup"
> became just too confusing.

I feel pretty strongly that this was the correct decision. I would
have gone further, and removed any mention of "Sift up", but you can't
win them all.

> * Instead of "root_displace", I used the name "replace_top", and
> "delete_top" for the old siftup function. Because we use "top" to refer to
> memtuples[0] more commonly than "root", in the existing comments.

Fine by me.

> * I shared the code between the delete-top and replace-top. Delete-top now
> calls the replace-top function, with the last element of the heap. Both
> functions have the same signature, i.e. they both take the checkIndex
> argument. Peter's patch left that out for the "replace" function, on
> performance grounds, but if that's worthwhile, that seems like a separate
> optimization. Might be worth benchmarking that separately, but I didn't want
> to conflate that with this patch.

Okay.

> * I replaced a few more siftup+insert calls with the new combined
> replace-top operation. Because why not.

I suppose that the consistency has value, from a code clarity standpoint.

> Thanks for the patch, Peter, and thanks for the review, Claudio!

Thanks Heikki!

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Pushed this "displace root" patch, with some changes:

Attached is rebased version of the entire patch series, which should
be applied on top of what you pushed to the master branch today.

This features a new scheme for managing workMem --
maintenance_work_mem is now treated as a high watermark/budget for the
entire CREATE INDEX operation, regardless of the number of workers.
This seems to work much better, so Robert was right to suggest it.

There were also improvements to the cost model, to weigh available
maintenance_work_mem under this new system. And, the cost model was
moved inside planner.c (next to plan_cluster_use_sort()), which is
really where it belongs. The cost model is still WIP, though, and I
didn't address some concerns of my own about how tuplesort.c
coordinates workers. I think that Robert's "condition variables" will
end up superseding that stuff anyway. And, I think that this v2 will
bitrot fairly soon, when Heikki commits what is in effect his version
of my 0002-* patch (that's unchanged, if only because it refactors
some things that the parallel CREATE INDEX patch is reliant on).

So, while there are still a few loose ends with this revision (it
should still certainly be considered WIP), I wanted to get a revision
out quickly because V1 has been left to bitrot for too long now, and
my schedule is very full for the next week, ahead of my leaving to go
on vacation (which is long overdue). Hopefully, I'll be able to get
out a third revision next Saturday, on top of the
by-then-presumably-committed new tape batch memory patch from Heikki,
just before I leave. I'd rather leave with a patch available that can
be cleanly applied, to make review as easy as possible, since it
wouldn't be great to have this V2 with bitrot for 10 days or more.

--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Sun, Sep 11, 2016 at 2:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Pushed this "displace root" patch, with some changes:
>
> Attached is rebased version of the entire patch series, which should
> be applied on top of what you pushed to the master branch today.

0003 looks like a sensible cleanup of our #include structure
regardless of anything this patch series is trying to accomplish, so
I've committed it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> Tape unification
> ----------------
>
> Sort operations have a unique identifier, generated before any workers
> are launched, using a scheme based on the leader's PID, and a unique
> temp file number. This makes all on-disk state (temp files managed by
> logtape.c) discoverable by the leader process. State in shared memory
> is sized in proportion to the number of workers, so the only thing
> about the data being sorted that gets passed around in shared memory
> is a little logtape.c metadata for tapes, describing for example how
> large each constituent BufFile is (a BufFile associated with one
> particular worker's tapeset).
>
> (See below also for notes on buffile.c's role in all of this, fd.c and
> resource management, etc.)
>
 > ...
>
> buffile.c, and "unification"
> ============================
>
> There has been significant new infrastructure added to make logtape.c
> aware of workers. buffile.c has in turn been taught about unification
> as a first class part of the abstraction, with low-level management of
> certain details occurring within fd.c. So, "tape unification" within
> processes to open other backend's logical tapes to generate a unified
> logical tapeset for the leader to merge is added. This is probably the
> single biggest source of complexity for the patch, since I must
> consider:
>
> * Creating a general, reusable abstraction for other possible BufFile
> users (logtape.c only has to serve tuplesort.c, though).
>
> * Logical tape free space management.
>
> * Resource management, file lifetime, etc. fd.c resource management
> can now close a file at xact end for temp files, while not deleting it
> in the leader backend (only the "owning" worker backend deletes the
> temp file it owns).
>
> * Crash safety (e.g., when to truncate existing temp files, and when not to).

I find this unification business really complicated. I think it'd be
simpler to keep the BufFiles and LogicalTapeSets separate, and instead
teach tuplesort.c how to merge tapes that live on different
LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
LogicalTapeSet can contain tapes from different underlying BufFiles.

What I have in mind is something like the attached patch. It refactors
LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a
LogicalTape as argument, instead of LogicalTapeSet and tape number.
LogicalTapeSet doesn't have the concept of a tape number anymore, it can
contain any number of tapes, and you can create more on the fly. With
that, it'd be fairly easy to make tuplesort.c merge LogicalTapes that
came from different tape sets, backed by different BufFiles. I think
that'd avoid much of the unification code.

That leaves one problem, though: reusing space in the final merge phase.
If the tapes being merged belong to different LogicalTapeSets, and
create one new tape to hold the result, the new tape cannot easily reuse
the space of the input tapes because they are on different tape sets.
But looking at your patch, ISTM you actually dodged that problem as well:

> +     * As a consequence of only being permitted to write to the leader
> +     * controlled range, parallel sorts that require a final materialized tape
> +     * will use approximately twice the disk space for temp files compared to
> +     * a more or less equivalent serial sort.  This is deemed acceptable,
> +     * since it is far rarer in practice for parallel sort operations to
> +     * require a final materialized output tape.  Note that this does not
> +     * apply to any merge process required by workers, which may reuse space
> +     * eagerly, just like conventional serial external sorts, and so
> +     * typically, parallel sorts consume approximately the same amount of disk
> +     * blocks as a more or less equivalent serial sort, even when workers must
> +     * perform some merging to produce input to the leader.

I'm slightly worried about that. Maybe it's OK for a first version, but
it'd be annoying in a query where a sort is below a merge join, for
example, so that you can't do the final merge on the fly because
mark/restore support is needed.

One way to fix that would be have all the parallel works share the work
files to begin with, and keep the "nFileBlocks" value in shared memory
so that the workers won't overlap each other. Then all the blocks from
different workers would be mixed together, though, which would hurt the
sequential pattern of the tapes, so each workers would need to allocate
larger chunks to avoid that.

- Heikki


Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> No merging in parallel
> ----------------------
>
> Currently, merging worker *output* runs may only occur in the leader
> process. In other words, we always keep n worker processes busy with
> scanning-and-sorting (and maybe some merging), but then all processes
> but the leader process grind to a halt (note that the leader process
> can participate as a scan-and-sort tuplesort worker, just as it will
> everywhere else, which is why I specified "parallel_workers = 7" but
> talked about 8 workers).
>
> One leader process is kept busy with merging these n output runs on
> the fly, so things will bottleneck on that, which you saw in the
> example above. As already described, workers will sometimes merge in
> parallel, but only their own runs -- never another worker's runs. I
> did attempt to address the leader merge bottleneck by implementing
> cross-worker run merging in workers. I got as far as implementing a
> very rough version of this, but initial results were disappointing,
> and so that was not pursued further than the experimentation stage.
>
> Parallel merging is a possible future improvement that could be added
> to what I've come up with, but I don't think that it will move the
> needle in a really noticeable way.

It'd be good if you could overlap the final merges in the workers with 
the merge in the leader. ISTM it would be quite straightforward to 
replace the final tape of each worker with a shared memory queue, so 
that the leader could start merging and returning tuples as soon as it 
gets the first tuple from each worker. Instead of having to wait for all 
the workers to complete first.

- Heikki




Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It'd be good if you could overlap the final merges in the workers with the
> merge in the leader. ISTM it would be quite straightforward to replace the
> final tape of each worker with a shared memory queue, so that the leader
> could start merging and returning tuples as soon as it gets the first tuple
> from each worker. Instead of having to wait for all the workers to complete
> first.

If you do that, make sure to have the leader read multiple tuples at a
time from each worker whenever possible.  It makes a huge difference
to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Sep 21, 2016 at 5:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I find this unification business really complicated.

I can certainly understand why you would. As I said, it's the most
complicated part of the patch, which overall is one of the most
ambitious patches I've ever written.

> I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I think that it won't be possible to make a LogicalTapeSet ever use
more than one BufFile without regressing the ability to eagerly reuse
space, which is almost the entire reason for logtape.c existing. The
whole indirect block thing is an idea borrowed from the FS world, of
course, and so logtape.c needs one block-device-like BufFile, with
blocks that can be reclaimed eagerly, but consumed for recycling in
*contiguous* order (which is why they're sorted using qsort() within
ltsGetFreeBlock()). You're going to increase the amount of random I/O
by using more than one BufFile for an entire tapeset, I think.

This patch you posted
("0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch") just
keeps one BufFile, and only changes the interface to expose the tapes
themselves to tuplesort.c, without actually making tuplesort.c do
anything with that capability. I see what you're getting at, I think,
but I don't see how that accomplishes all that much for parallel
CREATE INDEX. I mean, the special case of having multiple tapesets
from workers (not one "unified" tapeset created from worker temp files
from their tapesets to begin with) now needs special treatment.
Haven't you just moved the complexity around (once your patch is made
to care about parallelism)? Having multiple entire tapesets explicitly
from workers, with their own BufFiles, is not clearly less complicated
than managing ranges from BufFile fd.c files with delineated ranges of
"logical tapeset space". Seems almost equivalent, except that my way
doesn't bother tuplesort.c with any of this.

>> +        * As a consequence of only being permitted to write to the leader
>> +        * controlled range, parallel sorts that require a final
>> materialized tape
>> +        * will use approximately twice the disk space for temp files
>> compared to
>> +        * a more or less equivalent serial sort.

> I'm slightly worried about that. Maybe it's OK for a first version, but it'd
> be annoying in a query where a sort is below a merge join, for example, so
> that you can't do the final merge on the fly because mark/restore support is
> needed.

My intuition is that we'll *never* end up using this for merge joins.
I think that I could do better here (why should workers really care at
this point?), but just haven't bothered to.

This parallel sort implementation is something written with CREATE
INDEX and CLUSTER in mind only (maybe one or two other things, too). I
believe that for query execution, partitioning is the future [1]. With
merge joins, partitioning is desirable because it lets you push down
*everything* to workers, not just sorting (e.g., by aligning
partitioning boundaries on each side of each merge join sort in the
worker, and having the worker also "synchronize" each side of the
join, all independently and without a dependency on a final merge).

That's why I think it's okay that I use twice as much space for
randomAccess tuplesort.c callers. No real world caller will ever end
up needing to do this. It just seems like a good idea to support
randomAccess when using this new infrastructure, on general principle.
Forcing myself to support that case during initial development
actually resulted in much cleaner, less invasive changes to
tuplesort.c in general.

[1]
https://www.postgresql.org/message-id/flat/CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com#CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com
-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> It'd be good if you could overlap the final merges in the workers with the
>> merge in the leader. ISTM it would be quite straightforward to replace the
>> final tape of each worker with a shared memory queue, so that the leader
>> could start merging and returning tuples as soon as it gets the first tuple
>> from each worker. Instead of having to wait for all the workers to complete
>> first.
>
> If you do that, make sure to have the leader read multiple tuples at a
> time from each worker whenever possible.  It makes a huge difference
> to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.

That requires some kind of mutual exclusion mechanism, like an LWLock.
It's possible that merging everything lazily is actually the faster
approach, given this, and given the likely bottleneck on I/O at htis
stage. It's also certainly simpler to not overlap things. This is
something I've read about before [1], with "eager evaluation" sorting
not necessarily coming out ahead IIRC.

[1] http://digitalcommons.ohsu.edu/cgi/viewcontent.cgi?article=1193&context=csetech
-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Sat, Sep 24, 2016 at 9:07 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> It'd be good if you could overlap the final merges in the workers with the
>>> merge in the leader. ISTM it would be quite straightforward to replace the
>>> final tape of each worker with a shared memory queue, so that the leader
>>> could start merging and returning tuples as soon as it gets the first tuple
>>> from each worker. Instead of having to wait for all the workers to complete
>>> first.
>>
>> If you do that, make sure to have the leader read multiple tuples at a
>> time from each worker whenever possible.  It makes a huge difference
>> to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.
>
> That requires some kind of mutual exclusion mechanism, like an LWLock.

No, it doesn't.  Shared memory queues are single-reader, single-writer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That requires some kind of mutual exclusion mechanism, like an LWLock.
>
> No, it doesn't.  Shared memory queues are single-reader, single-writer.

The point is that there is a natural dependency when merging is
performed eagerly within the leader. One thing needs to be in lockstep
with the others. That's all.


-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Mon, Sep 26, 2016 at 3:40 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> That requires some kind of mutual exclusion mechanism, like an LWLock.
>>
>> No, it doesn't.  Shared memory queues are single-reader, single-writer.
>
> The point is that there is a natural dependency when merging is
> performed eagerly within the leader. One thing needs to be in lockstep
> with the others. That's all.

I don't know what any of that means.  You said we need something like
an LWLock, but I think we don't.  The workers just write the results
of their own final merges into shm_mqs.  The leader can read from any
given shm_mq until no more tuples can be read without blocking, just
like nodeGather.c does, or at least it can do that unless its own
queue fills up first.  No mutual exclusion mechanism is required for
any of that, as far as I can see - not an LWLock, and not anything
similar.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sun, Sep 11, 2016 at 11:05 AM, Peter Geoghegan <pg@heroku.com> wrote:
> So, while there are still a few loose ends with this revision (it
> should still certainly be considered WIP), I wanted to get a revision
> out quickly because V1 has been left to bitrot for too long now, and
> my schedule is very full for the next week, ahead of my leaving to go
> on vacation (which is long overdue). Hopefully, I'll be able to get
> out a third revision next Saturday, on top of the
> by-then-presumably-committed new tape batch memory patch from Heikki,
> just before I leave. I'd rather leave with a patch available that can
> be cleanly applied, to make review as easy as possible, since it
> wouldn't be great to have this V2 with bitrot for 10 days or more.

Heikki committed his preload memory patch a little later than
originally expected, 4 days ago. I attach V3 of my own parallel CREATE
INDEX patch, which should be applied on top of a today's git master
(there is a bugfix that reviewers won't want to miss -- commit
b56fb691). I have my own test suite, and have to some extent used TDD
for this patch, so rebasing was not so bad . My tests are rather rough
and ready, so I'm not going to post them here. (Changes in the
WaitLatch() API also caused bitrot, which is now fixed.)

Changes from V2:

* Since Heikki eliminated the need for any extra memtuple "slots"
(memtuples is now only exactly big enough for the initial merge heap),
an awful lot of code could be thrown out that managed sizing memtuples
in the context of the parallel leader (based on trends seen in
parallel workers). I was able to follow Heikki's example by
eliminating code for parallel sorting memtuples sizing. Throwing this
code out let me streamline a lot of stuff within tuplesort.c, which is
cleaned up quite a bit.

* Since this revision was mostly focused on fixing up logtape.c
(rebasing on top of Heikki's work), I also took the time to clarify
some things about how an block-based offset might need to be applied
within the leader. Essentially, outlining how and where that happens,
and where it doesn't and shouldn't happen. (An offset must sometimes
be applied to compensate for difference in logical BufFile positioning
(leader/worker differences) following leader's unification of worker
tapesets into one big tapset of its own.)

* max_parallel_workers_maintenance now supersedes the use of the new
parallel_workers index storage parameter. This matches existing heap
storage parameter behavior, and allows the implementation to add
practically no cycles as compared to master branch when the use of
parallelism is disabled by setting max_parallel_workers_maintenance to
0.

* New additions to the chapter in the documentation that Robert added
a little while back, "Chapter 15. Parallel Query". It's perhaps a bit
of a stretch to call this feature part of parallel query, but I think
that it works reasonably well. The optimizer does determine the number
of workers needed here, so while it doesn't formally produce a query
plan, I think the implication that it does is acceptable for
user-facing documentation. (Actually, it would be nice if you really
could EXPLAIN utility commands -- that would be a handy place to show
information about how they were executed.)

Maybe this new documentation describes things in what some would
consider to be excessive detail for users. The relatively detailed
information added on parallel sorting seemed to be in the pragmatic
spirit of the new chapter 15, so I thought I'd see what people
thought.

Work is still needed on:

* Cost model. Should probably attempt to guess final index size, and
derive calculation of number of workers from that. Also, I'm concerned
that I haven't given enough thought to the low end, where with default
settings most CREATE INDEX statements will use at least one parallel
worker.

* The whole way that I teach nbtsort.c to disallow catalog tables for
parallel CREATE INDEX due to concerns about parallel safety is in need
of expert review, preferably from Robert. It's complicated in a way
that relies on things happening or not happening from a distance.

* Heikki seems to want to change more about logtape.c, and its use of
indirection blocks. That may evolve, but for now I can only target the
master branch.

* More extensive performance testing. I think that this V3 is probably
the fastest version yet, what with Heikki's improvements, but I
haven't really verified that.

--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Work is still needed on:
>
> * Cost model. Should probably attempt to guess final index size, and
> derive calculation of number of workers from that. Also, I'm concerned
> that I haven't given enough thought to the low end, where with default
> settings most CREATE INDEX statements will use at least one parallel
> worker.
>
> * The whole way that I teach nbtsort.c to disallow catalog tables for
> parallel CREATE INDEX due to concerns about parallel safety is in need
> of expert review, preferably from Robert. It's complicated in a way
> that relies on things happening or not happening from a distance.
>
> * Heikki seems to want to change more about logtape.c, and its use of
> indirection blocks. That may evolve, but for now I can only target the
> master branch.
>
> * More extensive performance testing. I think that this V3 is probably
> the fastest version yet, what with Heikki's improvements, but I
> haven't really verified that.

I realize that you are primarily targeting utility commands here, and
that is obviously great, because making index builds faster is very
desirable. However, I'd just like to talk for a minute about how this
relates to parallel query.  With Rushabh's Gather Merge patch, you can
now have a plan that looks like Gather Merge -> Sort -> whatever.
That patch also allows other patterns that are useful completely
independently of this patch, like Finalize GroupAggregate -> Gather
Merge -> Partial GroupAggregate -> Sort -> whatever, but the Gather
Merge -> Sort -> whatever path is very related to what this patch
does.  For example, instead of committing this patch at all, we could
try to funnel index creation through the executor, building a plan of
that shape, and using the results to populate the index.  I'm not
saying that's a good idea, but it could be done.

On the flip side, what if anything can queries hope to get out of
parallel sort that they can't get out of Gather Merge?  One
possibility is that a parallel sort might end up being substantially
faster than Gather Merge-over-non-parallel sort.  In that case, we
obviously should prefer it.  Other possibilities seem a little
obscure.  For example, it's possible that you might want to have all
workers participate in sorting some data and then divide the result of
the sort into equal ranges that are again divided among the workers,
or that you might want all workers to sort and then each worker to
read a complete copy of the output data.  But these don't seem like
particularly mainstream needs, nor do they necessarily seem like
problems that parallel sort itself should be trying to solve.  The
Volcano paper[1], one of the oldest and most-cited sources I can find
for research into parallel execution and with a design fairly similar
to our own executor, describes various variants of what they call
Exchange, of which what we now call Gather is one.  They describe
another variant called Interchange, which acts like a Gather node
without terminating parallelism: every worker process reads the
complete output of an Interchange, which is the union of all rows
produced by all workers running the Interchange's input plan.  That
seems like a better design than coupling such data flows specifically
to parallel sort.

I'd like to think that parallel sort will help lots of queries, as
well as helping utility commands, but I'm not sure it will.  Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] "Volcano - an Extensible and Parallel Query Evaluation System",
https://pdfs.semanticscholar.org/865b/5f228f08ebac0b68d3a4bfd97929ee85e4b6.pdf
[2] See "C. Variants of the Exchange Operator" on p. 13 of [1]



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I realize that you are primarily targeting utility commands here, and
> that is obviously great, because making index builds faster is very
> desirable. However, I'd just like to talk for a minute about how this
> relates to parallel query.  With Rushabh's Gather Merge patch, you can
> now have a plan that looks like Gather Merge -> Sort -> whatever.
> That patch also allows other patterns that are useful completely
> independently of this patch, like Finalize GroupAggregate -> Gather
> Merge -> Partial GroupAggregate -> Sort -> whatever, but the Gather
> Merge -> Sort -> whatever path is very related to what this patch
> does.  For example, instead of committing this patch at all, we could
> try to funnel index creation through the executor, building a plan of
> that shape, and using the results to populate the index.  I'm not
> saying that's a good idea, but it could be done.

Right, but that would be essentially the same approach as mine, but, I
suspect, less efficient and more complicated. More importantly, it
wouldn't be partitioning, and partitioning is what we really need
within the executor.

> On the flip side, what if anything can queries hope to get out of
> parallel sort that they can't get out of Gather Merge?  One
> possibility is that a parallel sort might end up being substantially
> faster than Gather Merge-over-non-parallel sort.  In that case, we
> obviously should prefer it.

I must admit that I don't know enough about it to comment just yet.
Offhand, it occurs to me that the Gather Merge sorted input could come
from a number of different types of paths/nodes, whereas adopting what
I've done here could only work more or less equivalently to "Gather
Merge -> Sort -> Seq Scan" -- a special case, really.

> For example, it's possible that you might want to have all
> workers participate in sorting some data and then divide the result of
> the sort into equal ranges that are again divided among the workers,
> or that you might want all workers to sort and then each worker to
> read a complete copy of the output data.  But these don't seem like
> particularly mainstream needs, nor do they necessarily seem like
> problems that parallel sort itself should be trying to solve.

This project of mine is about parallelizing tuplesort.c, which isn't
really what you want for parallel query -- you shouldn't try to scope
the problem as "make the sort more scalable using parallelism" there.
Rather, you want to scope it at "make the execution of the entire
query more scalable using parallelism", which is really quite a
different thing, which necessarily involves the executor having direct
knowledge of partition boundaries. Maybe the executor enlists
tuplesort.c to help with those boundaries to some degree, but that
whole thing is basically something which treats tuplesort.c as a low
level primitive.

> The
> Volcano paper[1], one of the oldest and most-cited sources I can find
> for research into parallel execution and with a design fairly similar
> to our own executor, describes various variants of what they call
> Exchange, of which what we now call Gather is one.

I greatly respect the work of Goetz Graef, including his work on the
Volcano paper. Graef has been the single biggest external influence on
my work on Postgres.

> They describe
> another variant called Interchange, which acts like a Gather node
> without terminating parallelism: every worker process reads the
> complete output of an Interchange, which is the union of all rows
> produced by all workers running the Interchange's input plan.  That
> seems like a better design than coupling such data flows specifically
> to parallel sort.
>
> I'd like to think that parallel sort will help lots of queries, as
> well as helping utility commands, but I'm not sure it will.  Thoughts?

You are right that I'm targeting the cases where we can get real
benefits without really changing the tuplesort.h contract too much.
This is literally the parallel tuplesort.c patch, which probably isn't
very useful for parallel query, because the final output is always
consumed serially here (this doesn't matter all that much for CREATE
INDEX, I believe). This approach of mine seems like the simplest way
of getting a large benefit to users involving parallelizing sorting,
but I certainly don't imagine it to be the be all and end all.

I have at least tried to anticipate how tuplesort.c will eventually
serve the needs of partitioning for the benefit of parallel query. My
intuition is that you'll have to teach it about partitioning
boundaries fairly directly -- it won't do to add something generic to
the executor. And, it probably won't be the only thing that needs to
be taught about them.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Thu, Oct 13, 2016 at 12:35 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> On the flip side, what if anything can queries hope to get out of
>> parallel sort that they can't get out of Gather Merge?  One
>> possibility is that a parallel sort might end up being substantially
>> faster than Gather Merge-over-non-parallel sort.  In that case, we
>> obviously should prefer it.
>
> I must admit that I don't know enough about it to comment just yet.
> Offhand, it occurs to me that the Gather Merge sorted input could come
> from a number of different types of paths/nodes, whereas adopting what
> I've done here could only work more or less equivalently to "Gather
> Merge -> Sort -> Seq Scan" -- a special case, really.
>
>> For example, it's possible that you might want to have all
>> workers participate in sorting some data and then divide the result of
>> the sort into equal ranges that are again divided among the workers,
>> or that you might want all workers to sort and then each worker to
>> read a complete copy of the output data.  But these don't seem like
>> particularly mainstream needs, nor do they necessarily seem like
>> problems that parallel sort itself should be trying to solve.
>
> This project of mine is about parallelizing tuplesort.c, which isn't
> really what you want for parallel query -- you shouldn't try to scope
> the problem as "make the sort more scalable using parallelism" there.
> Rather, you want to scope it at "make the execution of the entire
> query more scalable using parallelism", which is really quite a
> different thing, which necessarily involves the executor having direct
> knowledge of partition boundaries.
>

Okay, but what is the proof or why do you think second is going to
better than first?  One thing which strikes as a major difference
between your approach and Gather Merge is that in your approach leader
has to wait till all the workers have done with their work on sorting
whereas with Gather Merge as soon as first one is done, leader starts
with merging.  I could be wrong here, but if I understood it
correctly, then there is a argument that Gather Merge kind of approach
can win in cases where some of the workers can produce sorted outputs
ahead of others and I am not sure if we can dismiss such cases.

+struct Sharedsort
+{
..
+ * Workers increment workersFinished to indicate having finished.  If
+ * this is equal to state.launched within the leader, leader is ready
+ * to merge runs.
+ *
+ * leaderDone indicates if leader is completely done (i.e., was
+ * tuplesort_end called against the state through which parallel output
+ * was consumed?)
+ */
+ int currentWorker;
+ int workersFinished;
..
}

By looking at 'workersFinished' usage, it looks like you have devised
a new way for leader to know when workers have finished which might be
required for this patch.  However, have you tried to use or
investigate if existing infrastructure which serves same purpose could
be used for it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, but what is the proof or why do you think second is going to
> better than first?

I don't have proof. It's my opinion that it probably would be, based
on partial information, and my intuition. It's hard to prove something
like that, because it's not really clear what that alternative would
look like. Also, finding all of that out would take a long time --
it's hard to prototype. Do tuple table slots need to care about
IndexTuples now? What does that even look like? What existing executor
code needs to be taught about this new requirement?

> One thing which strikes as a major difference
> between your approach and Gather Merge is that in your approach leader
> has to wait till all the workers have done with their work on sorting
> whereas with Gather Merge as soon as first one is done, leader starts
> with merging.  I could be wrong here, but if I understood it
> correctly, then there is a argument that Gather Merge kind of approach
> can win in cases where some of the workers can produce sorted outputs
> ahead of others and I am not sure if we can dismiss such cases.

How can it? You need to have at least one tuple from every worker
(before the worker has exhausted its supply of output tuples) in order
to merge to return the next tuple to the top level consumer (the thing
above the Gather Merge). If you're talking about "eager vs. lazy
merging", please see my previous remarks on that, on this thread. (In
any case, whether we merge more eagerly seems like an orthogonal
question to the one you ask.)

The first thing to note about my approach is that I openly acknowledge
that this parallel CREATE INDEX patch is not much use for parallel
query. I have only generalized tuplesort.c to support parallelizing a
sort operation. I think that parallel query needs partitioning to push
down parts of a sort to workers, with little or no need for them to be
funneled together at the end, since most tuples are eliminated before
being passed to the Gather/Gather Merge node. The partitioning part is
really hard.

I guess that Gather Merge nodes have value because they allow us to
preserve the sorted-ness of a parallel path, which might be most
useful because it enables things elsewhere. But, that doesn't really
recommend making Gather Merge nodes good at batch processing a large
number of tuples, I suspect. (One problem with the tuple queue
mechanism is that it can be a big bottleneck -- that's why we want to
eliminate most tuples before they're passed up to the leader, in the
case of parallel sequential scan in 9.6.)

I read the following paragraph from the Volcano paper just now:

"""
During implementation and benchmarking of parallel sorting, we added
two more features to exchange. First, we wanted to implement a merge
network in which some processors produce sorted streams merge
concurrently by other processors. Volcano’s sort iterator can be used
to generate a sorted stream. A merge iterator was easily derived from
the sort module. It uses a single level merge, instead of the cascaded
merge of runs used in sort. The input of a merge iterator is an
exchange. Differently from other operators, the merge iterator
requires to distinguish the input records by their producer. As an
example, for a join operation it does not matter where the input
records were created, and all inputs can be accumulated in a single
input stream. For a merge operation, it is crucial to distinguish the
input records by their producer in order to merge multiple sorted
streams correctly.
"""

I don't really understand this paragraph, but thought I'd ask: why the
need to "distinguish the input records by their producer in order to
merge multiple sorted streams correctly"? Isn't that talking about
partitioning, where each workers *ownership* of a range matters? My
patch doesn't care which values belong to which workers. And, it
focuses quite a lot on dealing well with the memory bandwidth bound,
I/O bound part of the sort where we write out the index itself, just
by piggy-backing on tuplesort.c. I don't think that that's useful for
a general-purpose executor node -- tuple-at-a-time processing when
fetching from workers would kill performance.

> By looking at 'workersFinished' usage, it looks like you have devised
> a new way for leader to know when workers have finished which might be
> required for this patch.  However, have you tried to use or
> investigate if existing infrastructure which serves same purpose could
> be used for it?

Yes, I have. I think that Robert's "condition variables" patch would
offer a general solution to what I've devised.

What I have there is, as you say, fairly ad-hoc, even though my
requirements are actually fairly general. I was actually annoyed that
there wasn't an easier way to do that myself. Robert has said that he
won't commit his "condition variables" work until it's clear that
there will be some use for the facility. Well, I'd use it for this
patch, if I could. Robert?

--
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Mon, Oct 17, 2016 at 8:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> This project of mine is about parallelizing tuplesort.c, which isn't
>> really what you want for parallel query -- you shouldn't try to scope
>> the problem as "make the sort more scalable using parallelism" there.
>> Rather, you want to scope it at "make the execution of the entire
>> query more scalable using parallelism", which is really quite a
>> different thing, which necessarily involves the executor having direct
>> knowledge of partition boundaries.
>
> Okay, but what is the proof or why do you think second is going to
> better than first?  One thing which strikes as a major difference
> between your approach and Gather Merge is that in your approach leader
> has to wait till all the workers have done with their work on sorting
> whereas with Gather Merge as soon as first one is done, leader starts
> with merging.  I could be wrong here, but if I understood it
> correctly, then there is a argument that Gather Merge kind of approach
> can win in cases where some of the workers can produce sorted outputs
> ahead of others and I am not sure if we can dismiss such cases.

Gather Merge can't emit a tuple unless it has buffered at least one
tuple from every producer; otherwise, the next tuple it receives from
one of those producers might proceed whichever tuple it chooses to
emit.  However, it doesn't need to wait until all of the workers are
completely done.  The leader only needs to be at least slightly ahead
of the slowest worker.  I'm not sure how that compares to Peter's
approach.

What I'm worried about is that we're implementing two separate systems
to do the same thing, and that the parallel sort approach is actually
a lot less general.  I think it's possible to imagine a Parallel Sort
implementation which does things Gather Merge can't.  If all of the
workers collaborate to sort all of the data rather than each worker
sorting its own data, then you've got something which Gather Merge
can't match.  But this is not that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Gather Merge can't emit a tuple unless it has buffered at least one
> tuple from every producer; otherwise, the next tuple it receives from
> one of those producers might proceed whichever tuple it chooses to
> emit.  However, it doesn't need to wait until all of the workers are
> completely done.  The leader only needs to be at least slightly ahead
> of the slowest worker.  I'm not sure how that compares to Peter's
> approach.

I don't think that eager merging will prove all that effective,
however it's implemented. I see a very I/O bound system when parallel
CREATE INDEX merges serially. There is no obvious reason why you'd
have a straggler worker process with CREATE INDEX, really.

> What I'm worried about is that we're implementing two separate systems
> to do the same thing, and that the parallel sort approach is actually
> a lot less general.  I think it's possible to imagine a Parallel Sort
> implementation which does things Gather Merge can't.  If all of the
> workers collaborate to sort all of the data rather than each worker
> sorting its own data, then you've got something which Gather Merge
> can't match.  But this is not that.

It's not that yet, certainly. I think I've sketched a path forward for
making partitioning a part of logtape.c that is promising. The sharing
of ranges within tapes and so on will probably have a significant
amount in common with what I've come up with.

I don't think that any executor infrastructure is a particularly good
model when *batch output* is needed -- the tuple queue mechanism will
be a significant bottleneck, particularly because it does not
integrate read-ahead, etc. The best case that I saw advertised for
Gather Merge was TPC-H query 9 [1]. That doesn't look like a good
proxy for how Gather Merge adapted to parallel CREATE INDEX would do,
since it benefits from the GroupAggregate merge having many equal
values, possibly with a clustering in the original tables that can
naturally be exploited (no TID tiebreaker needed, since IndexTuples
are not being merged). Also, it looks like Gather Merge may do that
well by enabling things, rather than parallelizing the sort
effectively per se. Besides, the query 9 case is significantly less
scalable than good cases for this parallel CREATE INDEX patch have
already been shown to be.

I think I've been pretty modest about what this parallel CREATE INDEX
patch gets us from the beginning. It is a generalization of
tuplesort.c to work in parallel; we need a lot more for that to make
things like GroupAggregate as scalable as possible, and I don't
pretend that this helps much with that. There are actually more
changes to nbtsort.c to coordinate all of this than there are to
tuplesort.c in the latest version, so I think that this simpler
approach for parallel CREATE INDEX and CLUSTER is worthwhile.

The bottom line is that it's inherently difficult for me to refute the
idea that Gather Merge could do just as well as what I have here,
because proving that involves adding a significant amount of new
infrastructure (e.g., to teach the executor about IndexTuples). I
think that the argument for this basic approach is sound (it appears
to offer comparable scalability to the parallel CREATE INDEX
implementations of other systems), but it's simply impractical for me
to offer much reassurance beyond that.

[1] https://github.com/tvondra/pg_tpch/blob/master/dss/templates/9.sql
-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Thu, Oct 20, 2016 at 12:03 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Gather Merge can't emit a tuple unless it has buffered at least one
>> tuple from every producer; otherwise, the next tuple it receives from
>> one of those producers might proceed whichever tuple it chooses to
>> emit.

Right. Now, after again looking at Gather Merge patch, I think I can
better understand how it performs merging.

>>  However, it doesn't need to wait until all of the workers are
>> completely done.  The leader only needs to be at least slightly ahead
>> of the slowest worker.  I'm not sure how that compares to Peter's
>> approach.
>
> I don't think that eager merging will prove all that effective,
> however it's implemented. I see a very I/O bound system when parallel
> CREATE INDEX merges serially. There is no obvious reason why you'd
> have a straggler worker process with CREATE INDEX, really.
>
>> What I'm worried about is that we're implementing two separate systems
>> to do the same thing, and that the parallel sort approach is actually
>> a lot less general.  I think it's possible to imagine a Parallel Sort
>> implementation which does things Gather Merge can't.  If all of the
>> workers collaborate to sort all of the data rather than each worker
>> sorting its own data, then you've got something which Gather Merge
>> can't match.  But this is not that.
>
> It's not that yet, certainly. I think I've sketched a path forward for
> making partitioning a part of logtape.c that is promising. The sharing
> of ranges within tapes and so on will probably have a significant
> amount in common with what I've come up with.
>
> I don't think that any executor infrastructure is a particularly good
> model when *batch output* is needed -- the tuple queue mechanism will
> be a significant bottleneck, particularly because it does not
> integrate read-ahead, etc.
>

Tuple queue mechanism might not be super-efficient for *batch output*
(cases where many tuples needs to be read and written), but I see no
reason why it will be slower than disk I/O which I think you are using
in the patch.  IIUC, in the patch each worker including leader does
the tape sort for it's share of tuples and then finally leader merges
and populates the index.  I am not sure if the mechanism used in patch
can be useful as compare to using tuple queue, if the workers can
finish their part of sorting in-memory.

>
> The bottom line is that it's inherently difficult for me to refute the
> idea that Gather Merge could do just as well as what I have here,
> because proving that involves adding a significant amount of new
> infrastructure (e.g., to teach the executor about IndexTuples).
>

I think, there could be a simpler way, like we can force the gather
merge node when all the tuples needs to be sorted and compute the time
till it merges all tuples.  Similarly, with your patch, we can wait
till final merge is completed.  However, after doing initial study of
both the patches, I feel one can construct cases where Gather Merge
can win and also there will be cases where your patch can win.  In
particular, the Gather Merge can win where workers needs to perform
sort mostly in-memory.  I am not sure if it's easy to get best of both
the worlds.

Your patch needs rebase and I noticed one warning.
sort\logtape.c(1422): warning C4700: uninitialized local variable 'lt' used

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I read the following paragraph from the Volcano paper just now:
>
> """
> During implementation and benchmarking of parallel sorting, we added
> two more features to exchange. First, we wanted to implement a merge
> network in which some processors produce sorted streams merge
> concurrently by other processors. Volcano’s sort iterator can be used
> to generate a sorted stream. A merge iterator was easily derived from
> the sort module. It uses a single level merge, instead of the cascaded
> merge of runs used in sort. The input of a merge iterator is an
> exchange. Differently from other operators, the merge iterator
> requires to distinguish the input records by their producer. As an
> example, for a join operation it does not matter where the input
> records were created, and all inputs can be accumulated in a single
> input stream. For a merge operation, it is crucial to distinguish the
> input records by their producer in order to merge multiple sorted
> streams correctly.
> """
>
> I don't really understand this paragraph, but thought I'd ask: why the
> need to "distinguish the input records by their producer in order to
> merge multiple sorted streams correctly"? Isn't that talking about
> partitioning, where each workers *ownership* of a range matters?
>

I think so, but it seems from above text that is mainly required for
merge iterator which probably will be used in merge join.

> My
> patch doesn't care which values belong to which workers. And, it
> focuses quite a lot on dealing well with the memory bandwidth bound,
> I/O bound part of the sort where we write out the index itself, just
> by piggy-backing on tuplesort.c. I don't think that that's useful for
> a general-purpose executor node -- tuple-at-a-time processing when
> fetching from workers would kill performance.
>

Right, but what is written in text quoted by you seems to be do-able
with tuple-at-a-time processing.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Fri, Oct 21, 2016 at 4:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> I read the following paragraph from the Volcano paper just now:
>>
>> """
>> During implementation and benchmarking of parallel sorting, we added
>> two more features to exchange. First, we wanted to implement a merge
>> network in which some processors produce sorted streams merge
>> concurrently by other processors. Volcano’s sort iterator can be used
>> to generate a sorted stream. A merge iterator was easily derived from
>> the sort module. It uses a single level merge, instead of the cascaded
>> merge of runs used in sort. The input of a merge iterator is an
>> exchange. Differently from other operators, the merge iterator
>> requires to distinguish the input records by their producer. As an
>> example, for a join operation it does not matter where the input
>> records were created, and all inputs can be accumulated in a single
>> input stream. For a merge operation, it is crucial to distinguish the
>> input records by their producer in order to merge multiple sorted
>> streams correctly.
>> """
>>
>> I don't really understand this paragraph, but thought I'd ask: why the
>> need to "distinguish the input records by their producer in order to
>> merge multiple sorted streams correctly"? Isn't that talking about
>> partitioning, where each workers *ownership* of a range matters?
>>
>
> I think so, but it seems from above text that is mainly required for
> merge iterator which probably will be used in merge join.
>
>> My
>> patch doesn't care which values belong to which workers. And, it
>> focuses quite a lot on dealing well with the memory bandwidth bound,
>> I/O bound part of the sort where we write out the index itself, just
>> by piggy-backing on tuplesort.c. I don't think that that's useful for
>> a general-purpose executor node -- tuple-at-a-time processing when
>> fetching from workers would kill performance.
>>
>
> Right, but what is written in text quoted by you seems to be do-able
> with tuple-at-a-time processing.
>

To be clear, by saying above, I don't mean that we should try that
approach instead of what you are proposing, but it is worth some
discussion to see if that has any significant merits.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Work is still needed on:
>
> * Cost model. Should probably attempt to guess final index size, and
> derive calculation of number of workers from that. Also, I'm concerned
> that I haven't given enough thought to the low end, where with default
> settings most CREATE INDEX statements will use at least one parallel
> worker.
>
> * The whole way that I teach nbtsort.c to disallow catalog tables for
> parallel CREATE INDEX due to concerns about parallel safety is in need
> of expert review, preferably from Robert. It's complicated in a way
> that relies on things happening or not happening from a distance.
>
> * Heikki seems to want to change more about logtape.c, and its use of
> indirection blocks. That may evolve, but for now I can only target the
> master branch.
>
> * More extensive performance testing. I think that this V3 is probably
> the fastest version yet, what with Heikki's improvements, but I
> haven't really verified that.

While I haven't made progress on any of these open items, I should
still get a version out that applies cleanly on top of git tip --
commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to
bitrot. I attach V4, which is a fairly mechanical rebase of V3, with
no notable behavioral changes or bug fixes.

--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Setup:
>
> CREATE TABLE parallel_sort_test AS
>     SELECT hashint8(i) randint,
>     md5(i::text) collate "C" padding1,
>     md5(i::text || '2') collate "C" padding2
>     FROM generate_series(0, 1e9::bigint) i;
>
> CHECKPOINT;
>
> This leaves us with a parallel_sort_test table that is 94 GB in size.
>
> SET maintenance_work_mem = '8GB';
>
> -- Serial case (external sort, should closely match master branch):
> CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
> (parallel_workers = 0);
>
> Total time: 00:15:42.15
>
> -- Patch with 8 tuplesort "sort-and-scan" workers (leader process
> participates as a worker here):
> CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
> (parallel_workers = 7);
>
> Total time: 00:06:03.86
>
> As you can see, the parallel case is 2.58x faster

I decided to revisit this exact benchmark, using the same AWS instance
type (the one with 16 HDDs, again configured in software RAID0) to see
how things had changed for both parallel and serial cases. I am now
testing V4. A lot changed in the last 3 months, with most of the
changes that help here now already committed to the master branch.

Relevant changes
================

* Heikki's major overhaul of preload memory makes CREATE INDEX merging
have more sequential access patterns. It also effectively allows us to
use more memory. It's possible that the biggest benefit it brings to
parallel CREATE INDEX is that is eliminates almost any random I/O
penalty from logtape.c fragmentation that an extra merge pass has;
parallel workers now usually do their own merge to produce one big run
for the leader to merge. It also improves CPU cache efficiency quite
directly, I think.

This is the patch that helps most. Many thanks to Heikki for driving
this forward.

* My patch to simplify and optimize how the K-way merge heap is
maintained (as tuples fill leaf pages of the final index structure)
makes the merge phase significantly less CPU bound overall.

(These first two items particularly help parallel CREATE INDEX, which
spends proportionally much more wall clock time merging than would be
expected for similar serial cases. Serial cases do of course also
benefit.)

* V2 of the patch (and all subsequent versions) apportioned slices of
maintenance_work_mem to workers. maintenance_work_mem became a
per-utility-operation budget, regardless of number of workers
launched. This means that workers have less memory than the original
V1 benchmark (they simply don't make use of it now), but this seems
unlikely to hurt. Possibly, it even helps.

* Andres' work on md.c scalability may have helped (seems unlikely
with these CREATE INDEX cases that produce indexes not in the hundreds
of gigabytes, though). It would help with *extremely* large index
creation, which we won't really look at here.

Things now look better than ever for the parallel CREATE INDEX patch.
While it's typical for about 75% of wall clock time to be spent on
sorting runs with serial CREATE INDEX, with the remaining 25% going on
merging/writing index, with parallel CREATE INDEX I now generally see
about a 50/50 split between parallel sorting of runs (including any
worker merging to produce final runs) and serial merging for final
on-the-fly merge where we actually write new index out as input is
merged. This is a *significant* improvement over what we saw here back
in August, where it was not uncommon for parallel CREATE INDEX to
spend *twice* as much time in the serial final on-the-fly merge step.

All improvements to the code that we've seen since August have
targeted this final on-the-fly merge bottleneck. (The final on-the-fly
merge is now *consistently* able to write out the index at a rate of
150MB/sec+ in my tests, which is pretty good.)

New results
==========

Same setup as one quoted above -- once again, we "SET
maintenance_work_mem = '8GB'".

-- Patch with 8 tuplesort "sort-and-scan" workers:
CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 7);

Total time: 00:04:24.93

-- Serial case:
CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 0);

Total time: 00:14:25.19

3.27x faster. Not bad. As you see in the quoted text, that was 2.58x
back in August, even though the implementation now uses a lot less
memory in parallel workers. And, that's without even considering the
general question of how much faster index creation can be compared to
Postgres 9.6 -- it's roughly 3.5x faster at times.

New case
========

Separately, using my gensort tool [1], I came up with a new test case.
The tool generated a 2.5 billion row table, sized at 159GB. This is
how long is takes to produce a 73GB index on the "sortkey" column of
the resulting table:

-- gensort "C" locale text parallel case:
CREATE INDEX test8 on sort_test(sortkey) WITH (parallel_workers = 7);

Total time: 00:16:19.63

-- gensort "C" locale text serial case:
CREATE INDEX test0 on sort_test(sortkey) WITH (parallel_workers = 0);

Total time: 00:45:56.96

That's a 2.81x improvement in creation time relative to a serial case.
Not quite as big a difference as seen in the first case, but remember
that this is just like cases that were only made something like 2x -
2.2x faster by the use of parallelism back in August (see full e-mail
quoted above [2]). These are cases involving a text column, or maybe a
numeric column, that have complex comparators used during merging that
must handle detoasting, possibly even allocate memory, etc. This
second result is therefore probably the more significant of the two
results shown, since it now seems like we're more consistently close
to the ~3x improvement that other major database systems also seem to
top out at as parallel CREATE INDEX workers are added. (I still can't
see any benefit with 16 workers; my guess is that the anti-scaling
begins even before the merge starts when using that many workers. That
guess is hard to verify, given the confounding factor of more workers
producing more runs, leaving more work for the serial merge phase.)

I'd still welcome benchmarking or performance validation from somebody else.

[1] https://github.com/petergeoghegan/gensort
[2] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Oct 19, 2016 at 11:33 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I don't think that eager merging will prove all that effective,
> however it's implemented. I see a very I/O bound system when parallel
> CREATE INDEX merges serially. There is no obvious reason why you'd
> have a straggler worker process with CREATE INDEX, really.

In an effort to head off any misunderstanding around this patch
series, I started a new Wiki page for it:

https://wiki.postgresql.org/wiki/Parallel_External_Sort

This talks about parallel CREATE INDEX in particular, and uses of
parallel external sort more generally, including future uses beyond
CREATE INDEX.

This approach worked very well for me during the UPSERT project, where
a detailed overview really helped. With UPSERT, it was particularly
difficult to keep the *current* state of things straight, such as
current open items for the patch, areas of disagreement, and areas
where there was no longer any disagreement or controversy. I don't
think that this patch is even remotely as complicated as UPSERT was,
but it's still something that has had several concurrently active
mailing list threads (threads that are at least loosely related to the
project), so I think that this will be useful.

I welcome anyone with an interest in this project to review the Wiki
page, add their own concerns to it with -hackers citation, and add
their own content around related work. There is a kind of unresolved
question around where the Gather Merge work might fit in to what I've
come up with aleady. There may be other unresolved questions like
that, that I'm not even aware of.

I commit to maintaining the new Wiki page as a useful starting
reference for understanding the current state of this patch. I hope
this makes looking into the patch series less intimidating for
potential reviewers.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Oct 24, 2016 at 6:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> * Cost model. Should probably attempt to guess final index size, and
>> derive calculation of number of workers from that. Also, I'm concerned
>> that I haven't given enough thought to the low end, where with default
>> settings most CREATE INDEX statements will use at least one parallel
>> worker.

> While I haven't made progress on any of these open items, I should
> still get a version out that applies cleanly on top of git tip --
> commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to
> bitrot. I attach V4, which is a fairly mechanical rebase of V3, with
> no notable behavioral changes or bug fixes.

I attach V5. Changes:

* A big cost model overhaul. Workers are logarithmically scaled based
on projected final *index* size, not current heap size, as was the
case in V4. A new nbtpage.c routine is added to estimate a
not-yet-built B-Tree index's size, now called by the optimizer. This
involves getting average item width for indexed attributes from
pg_attribute for the heap relation. There are some subtleties here
with partial indexes, null_frac, etc. I also refined the cap applied
on the number of workers that limits too many workers being launched
when there isn't so much maintenance_work_mem.

The cost model is much improved now -- it is now more than just a
placeholder, at least. It doesn't do things like launch a totally
inappropriate number of workers to build a very small partial index.
Granted, those workers would still have something to do -- scan the
heap -- but not enough to justify launching so many (that is,
launching as many as would be launched for an equivalent non-partial
index).

That having been said, things are still quite fudged here, and I
struggle to find any guiding principle around doing better on average.
I think that that's because of the inherent difficulty of modeling
what's going on, but I'd be happy to be proven wrong on that. In any
case, I think it's going to be fairly common for DBAs to want to use
the storage parameter to force the use of a particular number of
parallel workers.

(See also: my remarks below on how the new bt_estimate_nblocks()
SQL-callable function can give insight into the new cost model's
decisions.)

* Overhauled leader_mergeruns() further, to make it closer to
mergeruns(). We now always rewind input tapes. This simplification
involved refining some of the assertions within logtape.c, which is
also slightly simplified.

* 2 new testing tools are added to the final commit in the patch
series (not actually proposed for commit). I've added 2 new
SQL-callable functions to contrib/pageinspect.

The 2 new testing functions are:

bt_estimate_nblocks
-------------------

bt_estimate_nblocks() provides an easy way to see the optimizer's
projection of how large the final index will be. It returns an
estimate in blocks. Example:

mgd=# analyze;
ANALYZE
mgd=# select oid::regclass as rel,
bt_estimated_nblocks(oid),
relpages,
to_char(bt_estimated_nblocks(oid)::numeric / relpages, 'FM990.990') as
estimate_actual
from pg_class
where relkind = 'i'
order by relpages desc limit 20;

                        rel                         │
bt_estimated_nblocks │ relpages │ estimate_actual
────────────────────────────────────────────────────┼──────────────────────┼──────────┼─────────────────
 mgd.acc_accession_idx_accid                        │
107,091 │  106,274 │ 1.008
 mgd.acc_accession_0                                │
169,024 │  106,274 │ 1.590
 mgd.acc_accession_1                                │
169,024 │   80,382 │ 2.103
 mgd.acc_accession_idx_prefixpart                   │
76,661 │   80,382 │ 0.954
 mgd.acc_accession_idx_mgitype_key                  │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_clustered                    │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_createdby_key                │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_numericpart                  │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_logicaldb_key                │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_modifiedby_key               │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_pkey                             │
76,661 │   76,928 │ 0.997
 mgd.mgi_relationship_property_idx_propertyname_key │
74,197 │   74,462 │ 0.996
 mgd.mgi_relationship_property_idx_modifiedby_key   │
74,197 │   74,462 │ 0.996
 mgd.mgi_relationship_property_pkey                 │
74,197 │   74,462 │ 0.996
 mgd.mgi_relationship_property_idx_clustered        │
74,197 │   74,462 │ 0.996
 mgd.mgi_relationship_property_idx_createdby_key    │
74,197 │   74,462 │ 0.996
 mgd.seq_sequence_idx_clustered                     │
50,051 │   50,486 │ 0.991
 mgd.seq_sequence_raw_pkey                          │
35,826 │   35,952 │ 0.996
 mgd.seq_sequence_raw_idx_modifiedby_key            │
35,826 │   35,952 │ 0.996
 mgd.seq_source_assoc_idx_clustered                 │
35,822 │   35,952 │ 0.996
(20 rows)

I haven't tried to make the underlying logic as close to perfect as
possible, but it tends to be accurate in practice, as is evident from
this real-world example (this shows larger indexes following a
restoration of the mouse genome sample database [1]). Perhaps there
could be a role for a refined bt_estimate_nblocks() function in
determining when B-Tree indexes become bloated/unbalanced (maybe have
pgstatindex() estimate index bloat based on a difference between
projected and actual fan-in?). That has nothing to do with parallel
CREATE INDEX, though.

bt_main_forks_identical
-----------------------

bt_main_forks_identical() checks if 2 specific relations have bitwise
identical main forks. If they do, it returns the number of blocks in
the main fork of each. Otherwise, an error is raised.

Unlike any approach involving *writing* the index in parallel (e.g.,
any worthwhile approach based on data partitioning), the proposed
parallel CREATE INDEX implementation creates an identical index
representation to that created by any serial process (including, for
example, the master branch when CREATE INDEX uses an internal sort).
The index that you end up with when parallelism is used ought to be
100% identical in all cases.

(This is true because there is a TID tie-breaker when sorting B-Tree
index tuples, and because LSNs are set to 0 by CREATE INDEX. Why not
exploit that fact to test the implementation?)

If anyone can demonstrate that parallel CREATE INDEX fails to create a
non-bitwise-identical index representation to a "known good"
implementation, or can demonstrate that it doesn't consistently
produce exactly the same final index representation given the same
underlying table as input, then there *must* be a bug.
bt_main_forks_identical() gives reviewers an easy way to verify this,
perhaps just in passing during benchmarking.

pg_restore
==========

It occurs to me that parallel CREATE INDEX receives no special
consideration by pg_restore. This leaves it so that the use of
parallel CREATE INDEX can come down to whether or not
pg_class.reltuples is accidentally updated by something like an
initial CREATE INDEX. This is not ideal. There is also the questions
of how pg_restore -j cases ought to give special consideration to
parallel CREATE INDEX, if at all -- it's probably true that concurrent
index builds on the same relation do go together well with parallel
CREATE INDEX, but even in V5 pg_restore remains totally naive about
this.

That having been said, pg_restore currently does nothing clever with
maintenance_work_mem when multiple jobs are used, even though that
seems at least as useful as what I outline for parallel CREATE INDEX.
It's not clear how to judge this.

What do we need to teach pg_restore about parallel CREATE INDEX, if
anything at all? Could this be as simple as a blanket disabling of
parallelism for CREATE INDEX from pg_restore? Or, does it need to be
more sophisticated than that? I suppose that tools like reindexdb and
pgbench must be considered in a similar way.

Maybe we could get the number of blocks in the heap relation when its
pg_class.reltupes is 0, from the smgr, and then extrapolate the
reltuples using simple, generic logic, in the style of
vac_estimate_reltuples() (its "old_rel_pages" == 0 case). For now,
I've avoided doing that out of concern for the overhead in cases where
there are many small tables to be restored, and because it may be
better to err on the side of not using parallelism.

[1] https://wiki.postgresql.org/wiki/Sample_Databases
--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Mon, Nov 7, 2016 at 11:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I attach V5.

I gather that 0001, which puts a cap on the number of tapes, is not
actually related to the subject of this thread; it's an independent
change that you think is a good idea.  I reviewed the previous
discussion on this topic upthread, between you and Heikki, which seems
to me to contain more heat than light.  At least in my opinion, the
question is not whether a limit on the number of tapes is the best
possible system, but rather whether it's better than the status quo.
It's silly to refuse to make a simple change on the grounds that some
much more complex change might be better, because if somebody writes
that patch and it is better we can always revert 0001 then.  If 0001
involved hundreds of lines of invasive code changes, that argument
wouldn't apply, but it doesn't; it's almost a one-liner.

Now, on the other hand, as far as I can see, the actual amount of
evidence that 0001 is a good idea which has been presented in this
forum is pretty near zero.  You've argued for it on theoretical
grounds several times, but theoretical arguments are not a substitute
for test results.  Therefore, I decided that the best thing to do was
test it myself.  I wrote a little patch to add a GUC for
max_sort_tapes, which actually turns out not to work as I thought:
setting max_sort_tapes = 501 seems to limit the highest tape number to
501 rather than the number of tapes to 501, so there's a sort of
off-by-one error.  But that doesn't really matter.  The patch is
attached here for the convenience of anyone else who may want to
fiddle with this.

Next, I tried to set things up so that I'd get a large enough number
of tapes for the cap to matter.  To do that, I initialized with
"pgbench -i --unlogged-tables -s 20000" so that I had 2 billion
tuples.  Then I used this SQL query: "select sum(w+abalance) from
(select (aid::numeric * 7123000217)%1000000000 w, * from
pgbench_accounts order by 1) x".  The point of the math is to perturb
the ordering of the tuples so that they actually need to be sorted
instead of just passed through unchanged. The use of abalance in the
outer sum prevents an index-only-scan from being used, which makes the
sort wider; perhaps I should have tried to make it wider still, but
this is what I did.  I wanted to have more than 501 tapes because,
obviously, a concern with a change like this is that things might get
slower in the case where it forces a polyphase merge rather than a
single merge pass. And, of course, I set trace_sort = on.

Here's what my initial run looked like, in brief:

2016-11-09 15:37:52 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 15:37:59 UTC [44026] LOG:  switching to external sort with
937 tapes: CPU: user: 5.51 s, system: 0.27 s, elapsed: 6.56 s
2016-11-09 16:48:31 UTC [44026] LOG:  finished writing run 616 to tape
615: CPU: user: 4029.17 s, system: 152.72 s, elapsed: 4238.54 s
2016-11-09 16:48:31 UTC [44026] LOG:  using 246719 KB of memory for
read buffers among 616 input tapes
2016-11-09 16:48:39 UTC [44026] LOG:  performsort done (except 616-way
final merge): CPU: user: 4030.30 s, system: 152.98 s, elapsed: 4247.41
s
2016-11-09 18:33:30 UTC [44026] LOG:  external sort ended, 6255145
disk blocks used: CPU: user: 10214.64 s, system: 175.24 s, elapsed:
10538.06 s

And according to psql: Time: 10538068.225 ms (02:55:38.068)

Then I set max_sort_tapes = 501 and ran it again.  This time:

2016-11-09 19:05:22 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 19:05:28 UTC [44026] LOG:  switching to external sort with
502 tapes: CPU: user: 5.69 s, system: 0.26 s, elapsed: 6.13 s
2016-11-09 20:15:20 UTC [44026] LOG:  finished writing run 577 to tape
75: CPU: user: 3993.81 s, system: 153.42 s, elapsed: 4198.52 s
2016-11-09 20:15:20 UTC [44026] LOG:  using 249594 KB of memory for
read buffers among 501 input tapes
2016-11-09 20:21:19 UTC [44026] LOG:  finished 77-way merge step: CPU:
user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s
2016-11-09 20:21:19 UTC [44026] LOG:  performsort done (except 501-way
final merge): CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22
s
2016-11-09 21:38:12 UTC [44026] LOG:  external sort ended, 6255484
disk blocks used: CPU: user: 8848.81 s, system: 182.64 s, elapsed:
9170.62 s

And this one, according to psql: Time: 9170629.597 ms (02:32:50.630)

That looks very good.  On a test that runs for almost 3 hours, we
saved more than 20 minutes.  The overall runtime improvement is 23% in
a case where we would not expect this patch to do particularly well;
after all, without limiting the number of runs, we are able to
complete the sort with a single merge pass, whereas when we reduce the
number of runs, we now require a polyphase merge.  Nevertheless, we
come out way ahead, because the final merge pass gets way faster,
presumably because there are fewer tapes involved.  The first test
does a 616-way final merge and takes 6184.34 seconds to do it.  The
second test does a 501-way final merge and takes 4519.31 seconds to
do.  This increased final merge speed accounts for practically all of
the speedup, and the reason it's faster pretty much has to be that
it's merging fewer tapes.

That, in turn, happens for two reasons.  First, because limiting the
number of tapes increases slightly the memory available for storing
the tuples belonging to each run, we end up with fewer runs in the
first place.  The number of runs drops from from 616 to 577, about a
7% reduction.  Second, because we have more runs than tapes in the
second case, it does a 77-way merge prior to the final merge.  Because
of that 77-way merge, the time at which the second run starts
producing tuples is slightly later.  Instead of producing the first
tuple at 70:47.71, we have to wait until 75:72.22.  That's a small
disadvantage in this case, because it's hypothetically possible that a
query like this could have a LIMIT and we'd end up worse off overall.
However, that's pretty unlikely, for three reasons.  Number one, LIMIT
isn't likely to be used on queries of this type in the first place.
Number two, if it were used, we'd probably end up with a bounded sort
plan which would be way faster anyway.  Number three, if somehow we
still sorted the data set we'd still win in this case if the limit
were more than about 20% of the total number of tuples.  The much
faster run time to produce the whole data set is a small price to pay
for possibly needing to wait a little longer for the first tuple.

Admittedly, this is only one test, and some other test might show a
different result.  However, I believe that there aren't likely to be
many losing cases.  If the increased number of tapes doesn't force a
polyphase merge, we're almost certain to win, because in that case the
only thing that changes is that we have more memory with which to
produce each run.  On small sorts, this may not help much, but it
won't hurt.  Even if the increased number of tapes *does* force a
polyphase merge, the reduction in the number of initial runs and/or
the reduction in the number of runs in any single merge may add up to
a win, as in this example.  In fact, it may well be the case that the
optimal number of tapes is significantly less than 501.  It's hard to
tell for sure, but it sure looks like that 77-way non-final merge is
significantly more efficient than the final merge.

So, I'm now feeling pretty bullish about this patch, except for one
thing, which is that I think the comments are way off-base. Peter
writes: $$When allowedMem is significantly lower than what is required
for an internal sort, it is unlikely that there are benefits to
increasing the number of tapes beyond Knuth's "sweet spot" of 7.$$
I'm pretty sure that's totally wrong, first of all because commit
df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
precisely the thing which this comment says we shouldn't, secondly
because 501 is most definitely significantly higher than 7 so the code
and the comment don't even match, and thirdly because, as the comment
added in the commit says, each extra tape doesn't really cost that
much.  In this example, going from 501 tapes up to 937 tapes only
reduces memory available for tuples by about 7%, even though the
number of tapes have almost doubled.  If we had a sort with, say, 30
runs, do we really want to do a polyphase merge just to get a sub-1%
increase in the amount of memory per run?  I doubt it.

Given all that, what I'm inclined to do is rewrite the comment to say,
basically, that even though we can afford lots of tapes, it's better
not to allow too ridiculously many because (1) that eats away at the
amount of memory available for tuples in each initial run and (2) very
high-order final merges are not very efficient.  And then commit that.
If somebody wants to fine-tune the tape limit later after more
extensive testing or replacing it by some other system that is better,
great.

Sound OK?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Nov 9, 2016 at 4:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I gather that 0001, which puts a cap on the number of tapes, is not
> actually related to the subject of this thread; it's an independent
> change that you think is a good idea.  I reviewed the previous
> discussion on this topic upthread, between you and Heikki, which seems
> to me to contain more heat than light.

FWIW, I don't remember it that way. Heikki seemed to be uncomfortable
with the quasi-arbitrary choice of constant, rather than disagreeing
with the general idea of a cap. Or, maybe he thought I didn't go far
enough, by completely removing polyphase merge. I think that removing
polyphase merge would be an orthogonal change to this, though.

> Now, on the other hand, as far as I can see, the actual amount of
> evidence that 0001 is a good idea which has been presented in this
> forum is pretty near zero.  You've argued for it on theoretical
> grounds several times, but theoretical arguments are not a substitute
> for test results.

See the illustration in TAOCP, vol III, page 273 in the second edition
-- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
that it's actually a real-world benchmark.

I guess I felt that no one ever argued that as many tapes as possible
was sound on any grounds, even theoretical, and so didn't feel
obligated to test it until asked to do so. I think that the reason
that a cap like this didn't go in around the time that the growth
logic went in (2006) was because nobody followed up on it. If you look
at the archives, there is plenty of discussion of a cap like this at
the time.

> That looks very good.  On a test that runs for almost 3 hours, we
> saved more than 20 minutes.  The overall runtime improvement is 23% in
> a case where we would not expect this patch to do particularly well;
> after all, without limiting the number of runs, we are able to
> complete the sort with a single merge pass, whereas when we reduce the
> number of runs, we now require a polyphase merge.  Nevertheless, we
> come out way ahead, because the final merge pass gets way faster,
> presumably because there are fewer tapes involved.  The first test
> does a 616-way final merge and takes 6184.34 seconds to do it.  The
> second test does a 501-way final merge and takes 4519.31 seconds to
> do.  This increased final merge speed accounts for practically all of
> the speedup, and the reason it's faster pretty much has to be that
> it's merging fewer tapes.

It's CPU cache efficiency -- has to be.

> That, in turn, happens for two reasons.  First, because limiting the
> number of tapes increases slightly the memory available for storing
> the tuples belonging to each run, we end up with fewer runs in the
> first place.  The number of runs drops from from 616 to 577, about a
> 7% reduction.  Second, because we have more runs than tapes in the
> second case, it does a 77-way merge prior to the final merge.  Because
> of that 77-way merge, the time at which the second run starts
> producing tuples is slightly later.  Instead of producing the first
> tuple at 70:47.71, we have to wait until 75:72.22.  That's a small
> disadvantage in this case, because it's hypothetically possible that a
> query like this could have a LIMIT and we'd end up worse off overall.
> However, that's pretty unlikely, for three reasons.  Number one, LIMIT
> isn't likely to be used on queries of this type in the first place.
> Number two, if it were used, we'd probably end up with a bounded sort
> plan which would be way faster anyway.  Number three, if somehow we
> still sorted the data set we'd still win in this case if the limit
> were more than about 20% of the total number of tuples.  The much
> faster run time to produce the whole data set is a small price to pay
> for possibly needing to wait a little longer for the first tuple.

Cool.

> So, I'm now feeling pretty bullish about this patch, except for one
> thing, which is that I think the comments are way off-base. Peter
> writes: $When allowedMem is significantly lower than what is required
> for an internal sort, it is unlikely that there are benefits to
> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
> I'm pretty sure that's totally wrong, first of all because commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
> precisely the thing which this comment says we shouldn't

It's more complicated than that. As I said, I think that Knuth
basically had it right with his sweet spot of 7. I think that commit
df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
because a one-pass merge avoided certain overheads not inherent to
polyphase merge, like all that memory accounting stuff, extra palloc()
traffic, etc. The expanded use of per tape buffering we have even in
multi-pass cases likely makes that much less true for us these days.

The reason I haven't actually gone right back down to 7 with this cap
is that it's possible that the added I/O costs outweigh the CPU costs
in extreme cases, even though I think that polyphase merge doesn't
have all that much to do with I/O costs, even with its 1970s
perspective. Knuth doesn't say much about I/O costs -- it's more about
using an extremely small amount of memory effectively (minimizing CPU
costs with very little available main memory).

Furthermore, not limiting ourselves to 7 tapes and seeing a benefit
(benefitting from a few dozen or hundred instead) seems more possible
with the improved merge heap maintenance logic added recently, where
there could be perhaps hundreds of runs merged with very low CPU cost
in the event of presorted input (or, input that is inversely
logically/physically correlated). That would be true because we'd only
examine the top of the heap through, and so I/O costs may matter much
more.

Depending on the exact details, I bet you could see a benefit with
only 7 tapes due to CPU cache efficiency in a case like the one you
describe. Perhaps when sorting integers, but not when sorting collated
text. There are many competing considerations, which I've tried my
best to balance here with a merge order of 500.

> Sound OK?

I'm fine with not mentioning Knuth's sweet spot once more. I guess
it's not of much practical value that he was on to something with
that. I realize, on reflection, that my understanding of what's going
on is very nuanced.

Thanks
-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Nov 9, 2016 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.

Also, logtape.c fragmentation made multiple merge pass cases
experience increased random I/O in a way that was only an accident of
our implementation. We've fixed that now, but that problem must have
added further cost that df700e6b40195d28dc764e0c694ac8cef90d4638
*masked* when it was commited in 2006. (I do think that the problem
with the merge heap maintenance fixed recently in
24598337c8d214ba8dcf354130b72c49636bba69 was the biggest problem that
the 2006 work masked, though).

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Nov 9, 2016 at 7:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Now, on the other hand, as far as I can see, the actual amount of
>> evidence that 0001 is a good idea which has been presented in this
>> forum is pretty near zero.  You've argued for it on theoretical
>> grounds several times, but theoretical arguments are not a substitute
>> for test results.
>
> See the illustration in TAOCP, vol III, page 273 in the second edition
> -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
> that it's actually a real-world benchmark.

I don't have that publication, and I'm guessing that's not based on
PostgreSQL's implementation.  There's no substitute for tests using
the code we've actually got.

>> So, I'm now feeling pretty bullish about this patch, except for one
>> thing, which is that I think the comments are way off-base. Peter
>> writes: $When allowedMem is significantly lower than what is required
>> for an internal sort, it is unlikely that there are benefits to
>> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
>> I'm pretty sure that's totally wrong, first of all because commit
>> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
>> precisely the thing which this comment says we shouldn't
>
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.
>
> The reason I haven't actually gone right back down to 7 with this cap
> is that it's possible that the added I/O costs outweigh the CPU costs
> in extreme cases, even though I think that polyphase merge doesn't
> have all that much to do with I/O costs, even with its 1970s
> perspective. Knuth doesn't say much about I/O costs -- it's more about
> using an extremely small amount of memory effectively (minimizing CPU
> costs with very little available main memory).
>
> Furthermore, not limiting ourselves to 7 tapes and seeing a benefit
> (benefitting from a few dozen or hundred instead) seems more possible
> with the improved merge heap maintenance logic added recently, where
> there could be perhaps hundreds of runs merged with very low CPU cost
> in the event of presorted input (or, input that is inversely
> logically/physically correlated). That would be true because we'd only
> examine the top of the heap through, and so I/O costs may matter much
> more.
>
> Depending on the exact details, I bet you could see a benefit with
> only 7 tapes due to CPU cache efficiency in a case like the one you
> describe. Perhaps when sorting integers, but not when sorting collated
> text. There are many competing considerations, which I've tried my
> best to balance here with a merge order of 500.

I guess that's possible, but the problem with polyphase merge is that
the increased I/O becomes a pretty significant cost in a hurry.
Here's the same test with max_sort_tapes = 100:

2016-11-09 23:02:49 UTC [48551] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 23:02:55 UTC [48551] LOG:  switching to external sort with
101 tapes: CPU: user: 5.72 s, system: 0.25 s, elapsed: 6.04 s
2016-11-10 00:13:00 UTC [48551] LOG:  finished writing run 544 to tape
49: CPU: user: 4003.00 s, system: 156.89 s, elapsed: 4211.33 s
2016-11-10 00:16:52 UTC [48551] LOG:  finished 51-way merge step: CPU:
user: 4214.84 s, system: 161.94 s, elapsed: 4442.98 s
2016-11-10 00:25:41 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 4704.14 s, system: 170.83 s, elapsed: 4972.47 s
2016-11-10 00:36:47 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5333.12 s, system: 179.94 s, elapsed: 5638.52 s
2016-11-10 00:45:32 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5821.13 s, system: 189.00 s, elapsed: 6163.53 s
2016-11-10 01:01:29 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s
2016-11-10 01:01:29 UTC [48551] LOG:  performsort done (except 100-way
final merge): CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58
s
2016-11-10 01:45:40 UTC [48551] LOG:  external sort ended, 6255949
disk blocks used: CPU: user: 9271.07 s, system: 232.26 s, elapsed:
9771.49 s

This is already worse than max_sort_tapes = 501, though the total
runtime is still better than no cap (the time-to-first-tuple is way
worse, though).   I'm going to try max_sort_tapes = 10 next, but I
think the basic pattern is already fairly clear.  As you reduce the
cap on the number of tapes, (a) the time to build the initial runs
doesn't change very much, (b) the time to perform the final merge
decreases significantly, and (c) the time to perform the non-final
merges increases even faster.  In this particular test configuration
on this particular hardware, rewriting 77 tapes in the 501-tape
configuration wasn't too bad, but now that we're down to 100 tapes, we
have to rewrite 449 tapes out of a total of 544, and that's actually a
loss: rewriting the bulk of your data an extra time to save on cache
misses doesn't pay.  It would probably be even less good if there were
other concurrent activity on the system.  It's possible that if your
polyphase merge is actually being done all in memory, cache efficiency
might remain the dominant consideration, but I think we should assume
that a polyphase merge is doing actual I/O, because it's sort of
pointless to use that algorithm in the first place if there's no real
I/O involved.

At the moment, at least, it looks to me as though we don't need to be
afraid of a *little* bit of polyphase merging, but a *lot* of
polyphase merging is actually pretty bad.  In other words, by imposing
a limit of the number of tapes, we're going to improve sorts that are
smaller than work_mem  * num_tapes * ~1.5 -- because cache efficiency
will be better -- but above that things will probably get worse
because of the increased I/O cost.  From that point of view, a
500-tape limit is the same as saying that it's we don't think it's
entirely reasonable to try to perform a sort that exceeds work_mem by
a factor of more than ~750, whereas a 7-tape limit is the same as
saying that we don't think it's entirely reasonable to perform a sort
that exceeds work_mem by a factor of more than ~10.  That latter
proposition seems entirely untenable.  Our default work_mem setting is
4MB, and people will certainly expect to be able to get away with,
say, an 80MB sort without changing settings.  On the other hand, if
they're sorting more than 3GB with work_mem = 4MB, I think we'll be
justified in making a gentle suggestion that they reconsider that
setting.  Among other arguments, it's going to be pretty slow in that
case no matter what we do here.

Maybe another way of putting this is that, while there's clearly a
benefit to having some kind of a cap, it's appropriate to pick a large
value, such as 500.  Having no cap at all risks creating many extra
tapes that just waste memory, and also risks an unduly
cache-inefficient final merge.  Reigning that in makes sense.
However, we can't reign it in too far or we'll create slow polyphase
merges in case that are reasonably likely to occur in real life.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Nov 9, 2016 at 6:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I guess that's possible, but the problem with polyphase merge is that
> the increased I/O becomes a pretty significant cost in a hurry.

Not if you have a huge RAID array. :-)

Obviously I'm not seriously suggesting that we revise the cap from 500
to 7. We're only concerned about the constant factors here. There is a
clearly a need to make some simplifying assumptions. I think that you
understand this very well, though.

> Maybe another way of putting this is that, while there's clearly a
> benefit to having some kind of a cap, it's appropriate to pick a large
> value, such as 500.  Having no cap at all risks creating many extra
> tapes that just waste memory, and also risks an unduly
> cache-inefficient final merge.  Reigning that in makes sense.
> However, we can't reign it in too far or we'll create slow polyphase
> merges in case that are reasonably likely to occur in real life.

I completely agree with your analysis.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Nov 9, 2016 at 10:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Maybe another way of putting this is that, while there's clearly a
>> benefit to having some kind of a cap, it's appropriate to pick a large
>> value, such as 500.  Having no cap at all risks creating many extra
>> tapes that just waste memory, and also risks an unduly
>> cache-inefficient final merge.  Reigning that in makes sense.
>> However, we can't reign it in too far or we'll create slow polyphase
>> merges in case that are reasonably likely to occur in real life.
>
> I completely agree with your analysis.

Cool.  BTW, my run with 10 tapes completed in 10696528.377 ms
(02:58:16.528) - i.e. almost 3 minutes slower than with no tape limit.
Building runs took 4260.16 s, and the final merge pass began at
8239.12 s.  That's certainly better than I expected, and it seems to
show that even if the number of tapes is grossly inadequate for the
number of runs, you can still make up most of the time that you lose
to I/O with improved cache efficiency -- at least under favorable
circumstances.  Of course, on many systems I/O bandwidth will be a
scarce resource, so that argument can be overdone -- and even if not,
a 10-tape sort version takes FAR longer to deliver the first tuple.

I also tried this out with work_mem = 512MB.  Doubling work_mem
reduces the number of runs enough that we don't get a polyphase merge
in any case.  With no limit on tapes:

2016-11-10 11:24:45 UTC [54042] LOG:  switching to external sort with
1873 tapes: CPU: user: 11.34 s, system: 0.48 s, elapsed: 12.13 s
2016-11-10 12:36:22 UTC [54042] LOG:  finished writing run 308 to tape
307: CPU: user: 4096.63 s, system: 156.88 s, elapsed: 4309.66 s
2016-11-10 12:36:22 UTC [54042] LOG:  using 516563 KB of memory for
read buffers among 308 input tapes
2016-11-10 12:36:30 UTC [54042] LOG:  performsort done (except 308-way
final merge): CPU: user: 4097.75 s, system: 157.24 s, elapsed: 4317.76
s
2016-11-10 13:54:07 UTC [54042] LOG:  external sort ended, 6255577
disk blocks used: CPU: user: 8638.72 s, system: 177.42 s, elapsed:
8974.44 s

With a max_sort_tapes = 501:

2016-11-10 14:23:50 UTC [54042] LOG:  switching to external sort with
502 tapes: CPU: user: 10.99 s, system: 0.54 s, elapsed: 11.57 s
2016-11-10 15:36:47 UTC [54042] LOG:  finished writing run 278 to tape
277: CPU: user: 4190.31 s, system: 155.33 s, elapsed: 4388.86 s
2016-11-10 15:36:47 UTC [54042] LOG:  using 517313 KB of memory for
read buffers among 278 input tapes
2016-11-10 15:36:54 UTC [54042] LOG:  performsort done (except 278-way
final merge): CPU: user: 4191.36 s, system: 155.68 s, elapsed: 4395.66
s
2016-11-10 16:53:39 UTC [54042] LOG:  external sort ended, 6255699
disk blocks used: CPU: user: 8673.07 s, system: 175.93 s, elapsed:
9000.80 s

0.3% slower with the tape limit, but that might be noise.  Even if
not, it seems pretty silly to create 1873 tapes when we only need
~300.

At work_mem = 2GB:

2016-11-10 18:08:00 UTC [54042] LOG:  switching to external sort with
7490 tapes: CPU: user: 44.28 s, system: 1.99 s, elapsed: 46.33 s
2016-11-10 19:23:06 UTC [54042] LOG:  finished writing run 77 to tape
76: CPU: user: 4342.10 s, system: 156.21 s, elapsed: 4551.95 s
2016-11-10 19:23:06 UTC [54042] LOG:  using 2095202 KB of memory for
read buffers among 77 input tapes
2016-11-10 19:23:12 UTC [54042] LOG:  performsort done (except 77-way
final merge): CPU: user: 4343.36 s, system: 157.07 s, elapsed: 4558.79
s
2016-11-10 20:24:24 UTC [54042] LOG:  external sort ended, 6255946
disk blocks used: CPU: user: 7894.71 s, system: 176.36 s, elapsed:
8230.13 s

At work_mem = 2GB, max_sort_tapes = 501:

2016-11-10 21:28:23 UTC [54042] LOG:  switching to external sort with
502 tapes: CPU: user: 44.09 s,system: 1.94 s, elapsed: 46.07 s
2016-11-10 22:42:28 UTC [54042] LOG:  finished writing run 68 to tape
67: CPU: user: 4278.49 s, system: 154.39 s, elapsed: 4490.25 s
2016-11-10 22:42:28 UTC [54042] LOG:  using 2095427 KB of memory for
read buffers among 68 input tapes
2016-11-10 22:42:34 UTC [54042] LOG:  performsort done (except 68-way
final merge): CPU: user: 4279.60 s, system: 155.21 s, elapsed: 4496.83
s
2016-11-10 23:42:10 UTC [54042] LOG:  external sort ended, 6255983
disk blocks used: CPU: user: 7733.98 s, system: 173.85 s, elapsed:
8072.55 s

Roughly 2% faster.  Maybe still noise, but less likely.  7490 tapes
certainly seems over the top.

At work_mem = 8GB:

2016-11-14 19:17:28 UTC [54042] LOG:  switching to external sort with
29960 tapes: CPU: user: 183.80 s, system: 7.71 s, elapsed: 191.61 s
2016-11-14 20:32:02 UTC [54042] LOG:  finished writing run 20 to tape
19: CPU: user: 4431.44 s, system: 176.82 s, elapsed: 4665.16 s
2016-11-14 20:32:02 UTC [54042] LOG:  using 8388083 KB of memory for
read buffers among 20 input tapes
2016-11-14 20:32:26 UTC [54042] LOG:  performsort done (except 20-way
final merge): CPU: user: 4432.99 s, system: 181.29 s, elapsed: 4689.52
s
2016-11-14 21:30:56 UTC [54042] LOG:  external sort ended, 6256003
disk blocks used: CPU: user: 7835.83 s, system: 199.01 s, elapsed:
8199.29 s

At work_mem = 8GB, max_sort_tapes = 501:

2016-11-14 21:52:43 UTC [54042] LOG:  switching to external sort with
502 tapes: CPU: user: 181.08 s, system: 7.66 s, elapsed: 189.05 s
2016-11-14 23:06:06 UTC [54042] LOG:  finished writing run 17 to tape
16: CPU: user: 4381.56 s, system: 161.82 s, elapsed: 4591.63 s
2016-11-14 23:06:06 UTC [54042] LOG:  using 8388158 KB of memory for
read buffers among 17 input tapes
2016-11-14 23:06:36 UTC [54042] LOG:  performsort done (except 17-way
final merge): CPU: user: 4383.45 s, system: 165.32 s, elapsed: 4622.04
s
2016-11-14 23:54:00 UTC [54042] LOG:  external sort ended, 6256002
disk blocks used: CPU: user: 7124.49 s, system: 182.16 s, elapsed:
7466.18 s

Roughly 9% faster.  Building runs seems to be very slowly degrading as
we increase work_mem, but the final merge is speeding up somewhat more
quickly.  Intuitively that makes sense to me: if merging were faster
than quicksorting, we could just merge-sort all the time instead of
using quicksort for internal sorts.  Also, we've got 29960 tapes now,
better than three orders of magnitude more than what we actually need.
At this work_mem setting, 501 tapes is enough to efficiently sort at
least 4TB of data and quite possibly a good bit more.

So, committed 0001, with comment changes along the lines I proposed before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> What do we need to teach pg_restore about parallel CREATE INDEX, if
> anything at all? Could this be as simple as a blanket disabling of
> parallelism for CREATE INDEX from pg_restore? Or, does it need to be
> more sophisticated than that? I suppose that tools like reindexdb and
> pgbench must be considered in a similar way.

I still haven't resolved this question, which seems like the most
important outstanding question, but I attach V6. Changes:

* tuplesort.c was adapted to use the recently committed condition
variables stuff. This made things cleaner. No more ad-hoc WaitLatch()
looping.

* Adapted docs to mention the newly committed max_parallel_workers GUC
in the context of discussing proposed max_parallel_workers_maintenance
GUC.

* Fixed trivial assertion failure bug that could be tripped when a
conventional sort uses very little memory.

--
Peter Geoghegan

Attachment

Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Alvaro Herrera
Date:
Peter Geoghegan wrote:
> On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> > What do we need to teach pg_restore about parallel CREATE INDEX, if
> > anything at all? Could this be as simple as a blanket disabling of
> > parallelism for CREATE INDEX from pg_restore? Or, does it need to be
> > more sophisticated than that? I suppose that tools like reindexdb and
> > pgbench must be considered in a similar way.
> 
> I still haven't resolved this question, which seems like the most
> important outstanding question,

I don't think a patch must necessarily consider all possible uses that
the new feature may have.  If we introduce parallel index creation,
that's great; if pg_restore doesn't start using it right away, that's
okay.  You, or somebody else, can still patch it later.  The patch is
still a step forward.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sat, Dec 3, 2016 at 5:45 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I don't think a patch must necessarily consider all possible uses that
> the new feature may have.  If we introduce parallel index creation,
> that's great; if pg_restore doesn't start using it right away, that's
> okay.  You, or somebody else, can still patch it later.  The patch is
> still a step forward.

While I agree, right now pg_restore will tend to use or not use
parallelism for CREATE INDEX more or less by accident, based on
whether or not pg_class.reltuples has already been set by something
else (e.g., an earlier CREATE INDEX against the same table in the
restoration). That seems unacceptable. I haven't just suppressed the
use of parallel CREATE INDEX within pg_restore because that would be
taking a position on something I have a hard time defending any
particular position on. And so, I am slightly concerned about the
entire ecosystem of tools that could implicitly use parallel CREATE
INDEX, with undesirable consequences. Especially pg_restore.

It's not so much a hard question as it is an awkward one. I want to
handle any possible objection about there being future compatibility
issues with going one way or the other ("This paints us into a corner
with..."). And, there is no existing, simple way for pg_restore and
other tools to disable the use of parallelism due to the cost model
automatically kicking in, while still allowing the proposed new index
storage parameter ("parallel_workers") to force the use of
parallelism, which seems like something that should happen. (I might
have to add a new GUC like "enable_maintenance_paralleism", since
"max_parallel_workers_maintenance = 0" disables parallelism no matter
how it might be invoked).

In general, I have a positive outlook on this patch, since it appears
to compete well with similar implementations in other systems
scalability-wise. It does what it's supposed to do.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Tomas Vondra
Date:
On Sat, 2016-12-03 at 18:37 -0800, Peter Geoghegan wrote:
> On Sat, Dec 3, 2016 at 5:45 PM, Alvaro Herrera <alvherre@2ndquadrant.
> com> wrote:
> > 
> > I don't think a patch must necessarily consider all possible uses
> > that
> > the new feature may have.  If we introduce parallel index creation,
> > that's great; if pg_restore doesn't start using it right away,
> > that's
> > okay.  You, or somebody else, can still patch it later.  The patch
> > is
> > still a step forward.
> While I agree, right now pg_restore will tend to use or not use
> parallelism for CREATE INDEX more or less by accident, based on
> whether or not pg_class.reltuples has already been set by something
> else (e.g., an earlier CREATE INDEX against the same table in the
> restoration). That seems unacceptable. I haven't just suppressed the
> use of parallel CREATE INDEX within pg_restore because that would be
> taking a position on something I have a hard time defending any
> particular position on. And so, I am slightly concerned about the
> entire ecosystem of tools that could implicitly use parallel CREATE
> INDEX, with undesirable consequences. Especially pg_restore.
> 
> It's not so much a hard question as it is an awkward one. I want to
> handle any possible objection about there being future compatibility
> issues with going one way or the other ("This paints us into a corner
> with..."). And, there is no existing, simple way for pg_restore and
> other tools to disable the use of parallelism due to the cost model
> automatically kicking in, while still allowing the proposed new index
> storage parameter ("parallel_workers") to force the use of
> parallelism, which seems like something that should happen. (I might
> have to add a new GUC like "enable_maintenance_paralleism", since
> "max_parallel_workers_maintenance = 0" disables parallelism no matter
> how it might be invoked).

I do share your concerns about unpredictable behavior - that's
particularly worrying for pg_restore, which may be used for time-
sensitive use cases (DR, migrations between versions), so unpredictable
changes in behavior / duration are unwelcome.

But isn't this more a deficiency in pg_restore, than in CREATE INDEX?
The issue seems to be that the reltuples value may or may not get
updated, so maybe forcing ANALYZE (even very low statistics_target
values would do the trick, I think) would be more appropriate solution?
Or maybe it's time add at least some rudimentary statistics into the
dumps (the reltuples field seems like a good candidate).

Trying to fix this by adding more GUCs seems a bit strange to me.

> 
> In general, I have a positive outlook on this patch, since it appears
> to compete well with similar implementations in other systems
> scalability-wise. It does what it's supposed to do.
> 

+1 to that

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I do share your concerns about unpredictable behavior - that's
> particularly worrying for pg_restore, which may be used for time-
> sensitive use cases (DR, migrations between versions), so unpredictable
> changes in behavior / duration are unwelcome.

Right.

> But isn't this more a deficiency in pg_restore, than in CREATE INDEX?
> The issue seems to be that the reltuples value may or may not get
> updated, so maybe forcing ANALYZE (even very low statistics_target
> values would do the trick, I think) would be more appropriate solution?
> Or maybe it's time add at least some rudimentary statistics into the
> dumps (the reltuples field seems like a good candidate).

I think that there is a number of reasonable ways of looking at it. It
might also be worthwhile to have a minimal ANALYZE performed by CREATE
INDEX directly, iff there are no preexisting statistics (there is
definitely going to be something pg_restore-like that we cannot fix --
some ETL tool, for example). Perhaps, as an additional condition to
proceeding with such an ANALYZE, it should also only happen when there
is any chance at all of parallelism being used (but then you get into
having to establish the relation size reliably in the absence of any
pg_class.relpages, which isn't very appealing when there are many tiny
indexes).

In summary, I would really like it if a consensus emerged on how
parallel CREATE INDEX should handle the ecosystem of tools like
pg_restore, reindexdb, and so on. Personally, I'm neutral on which
general approach should be taken. Proposals from other hackers about
what to do here are particularly welcome.

-- 
Peter Geoghegan



Re: Parallel tuplesort (for parallel B-Tree index creation)

From
Haribabu Kommi
Date:


On Mon, Dec 5, 2016 at 7:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I do share your concerns about unpredictable behavior - that's
> particularly worrying for pg_restore, which may be used for time-
> sensitive use cases (DR, migrations between versions), so unpredictable
> changes in behavior / duration are unwelcome.

Right.

> But isn't this more a deficiency in pg_restore, than in CREATE INDEX?
> The issue seems to be that the reltuples value may or may not get
> updated, so maybe forcing ANALYZE (even very low statistics_target
> values would do the trick, I think) would be more appropriate solution?
> Or maybe it's time add at least some rudimentary statistics into the
> dumps (the reltuples field seems like a good candidate).

I think that there is a number of reasonable ways of looking at it. It
might also be worthwhile to have a minimal ANALYZE performed by CREATE
INDEX directly, iff there are no preexisting statistics (there is
definitely going to be something pg_restore-like that we cannot fix --
some ETL tool, for example). Perhaps, as an additional condition to
proceeding with such an ANALYZE, it should also only happen when there
is any chance at all of parallelism being used (but then you get into
having to establish the relation size reliably in the absence of any
pg_class.relpages, which isn't very appealing when there are many tiny
indexes).

In summary, I would really like it if a consensus emerged on how
parallel CREATE INDEX should handle the ecosystem of tools like
pg_restore, reindexdb, and so on. Personally, I'm neutral on which
general approach should be taken. Proposals from other hackers about
what to do here are particularly welcome.


Moved to next CF with "needs review" status.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I find this unification business really complicated. I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I just looked at the buffile.c/buffile.h changes in the latest version
of the patch and I agree with this criticism, though maybe not with
the proposed solution.  I actually don't understand what "unification"
is supposed to mean.  The patch really doesn't explain that anywhere
that I can see.  It says stuff like:

+ * Parallel operations can use an interface to unify multiple worker-owned
+ * BufFiles and a leader-owned BufFile within a leader process.  This relies
+ * on various fd.c conventions about the naming of temporary files.

That comment tells you that unification is a thing you can do -- via
an unspecified interface for unspecified reasons using unspecified
conventions -- but it doesn't tell you what the semantics of it are
supposed to be.  For example, if we "unify" several BufFiles, do they
then have a shared seek pointer?  Do the existing contents effectively
get concatenated in an unpredictable order, or are they all expected
to be empty at the time unification happens?  Or something else?  It's
fine to make up new words -- indeed, in some sense that is the essence
of writing any complex problem -- but you have to define them.  As far
as I can tell, the idea is that we're somehow magically concatenating
the BufFiles into one big super-BufFile, but I'm fuzzy on exactly
what's supposed to be going on there.

It's hard to understand how something like this doesn't leak
resources.  Maybe that's been thought about here, but it isn't very
clear to me how it's supposed to work.  In Heikki's proposal, if
process A is trying to read a file owned by process B, and process B
dies and removes the file before process A gets around to reading it,
we have got trouble, especially on Windows which apparently has low
tolerance for such things.  Peter's proposal avoids that - I *think* -
by making the leader responsible for all resource cleanup, but that's
inferior to the design we've used for other sorts of shared resource
cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
detach always takes responsibility.  That avoids assuming that we're
always dealing with a leader-follower situation, it doesn't
categorically require the leader to be the one who creates the shared
resource, and it doesn't require the leader to be the last process to
die.

Imagine a data structure that is stored in dynamic shared memory and
contains space for a filename, a reference count, and a mutex.  Let's
call this thing a SharedTemporaryFile or something like that.  It
offers these APIs:

extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);

After setting aside sizeof(SharedTemporaryFile) bytes in your shared
DSM sgement, you call SharedTemporaryFileInitialize() to initialize
them.  Then, every process that cares about the file does
SharedTemporaryFileAttach(), which bumps the reference count and sets
an on_dsm_detach hook to decrement the reference count and unlink the
file if the reference count thereby reaches 0.  One of those processes
does SharedTemporaryFileAssign(), which fills in the pathname and
clears FD_TEMPORARY.  Then, any process that has attached can call
SharedTemporaryFileGetFile() to get a File which can then be accessed
normally.  So, the pattern for parallel sort would be:

- Leader sets aside space and calls SharedTemporaryFileInitialize()
and SharedTemporaryFileAttach().
- The cooperating worker calls SharedTemporaryFileAttach() and then
SharedTemporaryFileAssign().
- The leader then calls SharedTemporaryFileGetFile().

Since the leader can attach to the file before the path name is filled
in, there's no window where the file is at risk of being leaked.
Before SharedTemporaryFileAssign(), the worker is solely responsible
for removing the file; after that call, whichever of the leader and
the worker exits last will remove the file.

> That leaves one problem, though: reusing space in the final merge phase. If
> the tapes being merged belong to different LogicalTapeSets, and create one
> new tape to hold the result, the new tape cannot easily reuse the space of
> the input tapes because they are on different tape sets.

If the worker is always completely finished with the tape before the
leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
the tape and overwrite it like any other?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> What I have in mind is something like the attached patch. It refactors
>> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
>> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
>> doesn't have the concept of a tape number anymore, it can contain any number
>> of tapes, and you can create more on the fly. With that, it'd be fairly easy
>> to make tuplesort.c merge LogicalTapes that came from different tape sets,
>> backed by different BufFiles. I think that'd avoid much of the unification
>> code.
>
> I just looked at the buffile.c/buffile.h changes in the latest version
> of the patch and I agree with this criticism, though maybe not with
> the proposed solution.  I actually don't understand what "unification"
> is supposed to mean.  The patch really doesn't explain that anywhere
> that I can see.  It says stuff like:
>
> + * Parallel operations can use an interface to unify multiple worker-owned
> + * BufFiles and a leader-owned BufFile within a leader process.  This relies
> + * on various fd.c conventions about the naming of temporary files.

Without meaning to sound glib, unification is the process by which
parallel CREATE INDEX has the leader read temp files from workers
sufficient to complete its final on-the-fly merge. So, it's a
terminology that's bit like "speculative insertion" was up until
UPSERT was committed: a concept that is somewhat in flux, and
describes a new low-level mechanism built to support a higher level
operation, which must accord with a higher level set of requirements
(so, for speculative insertion, that would be avoiding "unprincipled
deadlocks" and so on). That being the case, maybe "unification" isn't
useful as an precise piece of terminology at this point, but that will
change.

While I'm fairly confident that I basically have the right idea with
this patch, I think that you are better at judging the ins and outs of
resource management than I am, not least because of the experience of
working on parallel query itself. Also, I'm signed up to review
parallel hash join in large part because I think there might be some
convergence concerning the sharing of BufFiles among parallel workers.
I don't think I'm qualified to judge what a general abstraction like
this should look like, but I'm trying to get there.

> That comment tells you that unification is a thing you can do -- via
> an unspecified interface for unspecified reasons using unspecified
> conventions -- but it doesn't tell you what the semantics of it are
> supposed to be.  For example, if we "unify" several BufFiles, do they
> then have a shared seek pointer?

No.

> Do the existing contents effectively
> get concatenated in an unpredictable order, or are they all expected
> to be empty at the time unification happens?  Or something else?

The order is the same order as ordinal identifiers are assigned to
workers within tuplesort.c, which is undefined, with the notable
exception of the leader's own identifier (-1) and area of the unified
BufFile space (this is only relevant in randomAccess cases, where
leader may write stuff out to its own reserved part of the BufFile
space). It only matters that the bit of metadata in shared memory is
in that same order, which it clearly will be. So, it's unpredictable,
but in the same way that ordinal identifiers are assigned in a
not-well-defined order; it doesn't or at least shouldn't matter. We
can imagine a case where it does matter, and we probably should, but
that case isn't parallel CREATE INDEX.

>It's fine to make up new words -- indeed, in some sense that is the essence
> of writing any complex problem -- but you have to define them.

I invite you to help me define this new word.

> It's hard to understand how something like this doesn't leak
> resources.  Maybe that's been thought about here, but it isn't very
> clear to me how it's supposed to work.

I agree that it would be useful to centrally document what all this
unification stuff is about. Suggestions on where that should live are
welcome.

> In Heikki's proposal, if
> process A is trying to read a file owned by process B, and process B
> dies and removes the file before process A gets around to reading it,
> we have got trouble, especially on Windows which apparently has low
> tolerance for such things.  Peter's proposal avoids that - I *think* -
> by making the leader responsible for all resource cleanup, but that's
> inferior to the design we've used for other sorts of shared resource
> cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
> detach always takes responsibility.

Maybe it's inferior to that, but I think what Heikki proposes is more
or less complementary to what I've proposed, and has nothing to do
with resource management and plenty to do with making the logtape.c
interface look nice, AFAICT. It's also about refactoring/simplifying
logtape.c itself, while we're at it. I believe that Heikki has yet to
comment either way on my approach to resource management, one aspect
of the patch that I was particularly keen on your looking into.

The theory of operation here is that workers own their own BufFiles,
and are responsible for deleting them when they die. The assumption,
rightly or wrongly, is that it's sufficient that workers flush
everything out (write out temp files), and yield control to the
leader, which will open their temp files for the duration of the
leader final on-the-fly merge. The resource manager in the leader
knows it isn't supposed to ever delete worker-owned files (just
close() the FDs), and the leader errors if it cannot find temp files
that match what it expects. If there is a an error in the leader, it
shuts down workers, and they clean up, more than likely. If there is
an error in the worker, or if the files cannot be deleted (e.g., if
there is a classic hard crash scenario), we should also be okay,
because nobody will trip up on some old temp file from some worker,
since fd.c has some gumption about what workers need to do (and what
the leader needs to avoid) in the event of a hard crash. I don't see a
risk of file descriptor leaks, which may or may not have been part of
your concern (please clarify).

> That avoids assuming that we're
> always dealing with a leader-follower situation, it doesn't
> categorically require the leader to be the one who creates the shared
> resource, and it doesn't require the leader to be the last process to
> die.

I have an open mind about that, especially given the fact that I hope
to generalize the unification stuff further, but I am not aware of any
reason why that is strictly necessary.

> Imagine a data structure that is stored in dynamic shared memory and
> contains space for a filename, a reference count, and a mutex.  Let's
> call this thing a SharedTemporaryFile or something like that.  It
> offers these APIs:
>
> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);

I'm a little bit tired right now, and I have yet to look at Thomas'
parallel hash join patch in any detail. I'm interested in what you
have to say here, but I think that I need to learn more about its
requirements in order to have an informed opinion.

>> That leaves one problem, though: reusing space in the final merge phase. If
>> the tapes being merged belong to different LogicalTapeSets, and create one
>> new tape to hold the result, the new tape cannot easily reuse the space of
>> the input tapes because they are on different tape sets.
>
> If the worker is always completely finished with the tape before the
> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
> the tape and overwrite it like any other?

I'll remind you that parallel CREATE INDEX doesn't actually ever need
to be randomAccess, and so we are not actually going to ever need to
do this as things stand. I wrote the code that way in order to not
break the existing interface, which seemed like a blocker to posting
the patch. I am open to the idea of such an "adoption" occurring, even
though it actually wouldn't help any case that exists in the patch as
proposed. I didn't go that far in part because it seemed premature,
given that nobody had looked at my work to date at the time, and given
the fact that there'd be no initial user-visible benefit, and given
how the exact meaning of "unification" was (and is) somewhat in flux.

I see no good reason to not do that, although that might change if I
actually seriously undertook to teach the leader about this kind of
"adoption". I suspect that the interface specification would make for
confusing reading, which isn't terribly appealing, but I'm sure I
could manage to make it work given time.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Dec 20, 2016 at 8:14 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Without meaning to sound glib, unification is the process by which
> parallel CREATE INDEX has the leader read temp files from workers
> sufficient to complete its final on-the-fly merge.

That's not glib, but you can't in the end define BufFile unification
in terms of what parallel CREATE INDEX needs.  Whatever changes we
make to lower-level abstractions in the service of some higher-level
goal need to be explainable on their own terms.

>>It's fine to make up new words -- indeed, in some sense that is the essence
>> of writing any complex problem -- but you have to define them.
>
> I invite you to help me define this new word.

If at some point I'm able to understand what it means, I'll try to do
that.  I think you're loosely using "unification" to mean combining
stuff from different backends in some way that depends on the
particular context, so that "BufFile unification" can be different
from "LogicalTape unification".  But that's just punting the question
of what each of those things actually are.

> Maybe it's inferior to that, but I think what Heikki proposes is more
> or less complementary to what I've proposed, and has nothing to do
> with resource management and plenty to do with making the logtape.c
> interface look nice, AFAICT. It's also about refactoring/simplifying
> logtape.c itself, while we're at it. I believe that Heikki has yet to
> comment either way on my approach to resource management, one aspect
> of the patch that I was particularly keen on your looking into.

My reading of Heikki's point was that there's not much point in
touching the BufFile level of things if we can do all of the necessary
stuff at the LogicalTape level, and I agree with him about that.  If a
shared BufFile had a shared read-write pointer, that would be a good
justification for having it.  But it seems like unification at the
BufFile level is just concatenation, and that can be done just as well
at the LogicalTape level, so why tinker with BufFile?  As I've said, I
think there's some low-level hacking needed here to make sure files
get removed at the correct time in all cases, but apart from that I
see no good reason to push the concatenation operation all the way
down into BufFile.

> The theory of operation here is that workers own their own BufFiles,
> and are responsible for deleting them when they die. The assumption,
> rightly or wrongly, is that it's sufficient that workers flush
> everything out (write out temp files), and yield control to the
> leader, which will open their temp files for the duration of the
> leader final on-the-fly merge. The resource manager in the leader
> knows it isn't supposed to ever delete worker-owned files (just
> close() the FDs), and the leader errors if it cannot find temp files
> that match what it expects. If there is a an error in the leader, it
> shuts down workers, and they clean up, more than likely. If there is
> an error in the worker, or if the files cannot be deleted (e.g., if
> there is a classic hard crash scenario), we should also be okay,
> because nobody will trip up on some old temp file from some worker,
> since fd.c has some gumption about what workers need to do (and what
> the leader needs to avoid) in the event of a hard crash. I don't see a
> risk of file descriptor leaks, which may or may not have been part of
> your concern (please clarify).

I don't think there's any new issue with file descriptor leaks here,
but I think there is a risk of calling unlink() too early or too late
with your design.  My proposal was an effort to nail that down real
tight.

>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> I'll remind you that parallel CREATE INDEX doesn't actually ever need
> to be randomAccess, and so we are not actually going to ever need to
> do this as things stand. I wrote the code that way in order to not
> break the existing interface, which seemed like a blocker to posting
> the patch. I am open to the idea of such an "adoption" occurring, even
> though it actually wouldn't help any case that exists in the patch as
> proposed. I didn't go that far in part because it seemed premature,
> given that nobody had looked at my work to date at the time, and given
> the fact that there'd be no initial user-visible benefit, and given
> how the exact meaning of "unification" was (and is) somewhat in flux.
>
> I see no good reason to not do that, although that might change if I
> actually seriously undertook to teach the leader about this kind of
> "adoption". I suspect that the interface specification would make for
> confusing reading, which isn't terribly appealing, but I'm sure I
> could manage to make it work given time.

I think the interface is pretty clear: the worker's logical tapes get
incorporated into the leader's LogicalTapeSet as if they'd been there
all along.  After all, by the time this is happening, IIUC (please
confirm), the worker is done with those tapes and will never read or
modify them again.  If that's right, the worker just needs a way to
identify those tapes to the leader, which can then add them to its
LogicalTapeSet.  That's it.  It needs a way to identify them, but I
think that shouldn't be hard; in fact, I think your patch has
something like that already.  And it needs to make sure that the files
get removed at the right time, but I already sketched a solution to
that problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Heikki Linnakangas
Date:
On 12/21/2016 12:53 AM, Robert Haas wrote:
>> That leaves one problem, though: reusing space in the final merge phase. If
>> the tapes being merged belong to different LogicalTapeSets, and create one
>> new tape to hold the result, the new tape cannot easily reuse the space of
>> the input tapes because they are on different tape sets.
>
> If the worker is always completely finished with the tape before the
> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
> the tape and overwrite it like any other?

Currently, the logical tape code assumes that all tapes in a single 
LogicalTapeSet are allocated from the same BufFile. The logical tape's 
on-disk format contains block numbers, to point to the next/prev block 
of the tape [1], and they're assumed to refer to the same file. That 
allows reusing space efficiently during the merge. After you have read 
the first block from tapes A, B and C, you can immediately reuse those 
three blocks for output tape D.

Now, if you read multiple tapes, from different LogicalTapeSet, hence 
backed by different BufFiles, you cannot reuse the space from those 
different tapes for a single output tape, because the on-disk format 
doesn't allow referring to blocks in other files. You could reuse the 
space of *one* of the input tapes, by placing the output tape in the 
same LogicalTapeSet, but not all of them.

We could enhance that, by using "filename + block number" instead of 
just block number, in the pointers in the logical tapes. Then you could 
spread one logical tape across multiple files. Probably not worth it in 
practice, though.


[1] As the code stands, there are no next/prev pointers, but a tree of 
"indirect" blocks. But I'm planning to change that to simpler next/prev 
pointers, in 
https://www.postgresql.org/message-id/flat/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi

- Heikki




Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Dec 21, 2016 at 7:04 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> Currently, the logical tape code assumes that all tapes in a single
> LogicalTapeSet are allocated from the same BufFile. The logical tape's
> on-disk format contains block numbers, to point to the next/prev block of
> the tape [1], and they're assumed to refer to the same file. That allows
> reusing space efficiently during the merge. After you have read the first
> block from tapes A, B and C, you can immediately reuse those three blocks
> for output tape D.

I see.  Hmm.

> Now, if you read multiple tapes, from different LogicalTapeSet, hence backed
> by different BufFiles, you cannot reuse the space from those different tapes
> for a single output tape, because the on-disk format doesn't allow referring
> to blocks in other files. You could reuse the space of *one* of the input
> tapes, by placing the output tape in the same LogicalTapeSet, but not all of
> them.
>
> We could enhance that, by using "filename + block number" instead of just
> block number, in the pointers in the logical tapes. Then you could spread
> one logical tape across multiple files. Probably not worth it in practice,
> though.

OK, so the options as I understand them are:

1. Enhance the logical tape set infrastructure in the manner you
mention, to support filename (or more likely a proxy for filename) +
block number in the logical tape pointers.  Then, tapes can be
transferred from one LogicalTapeSet to another.

2. Enhance the BufFile infrastructure to support some notion of a
shared BufFile so that multiple processes can be reading and writing
blocks in the same BufFile.  Then, extend the logical tape
infrastruture so that we also have the notion of a shared LogicalTape.
This means that things like ltsGetFreeBlock() need to be re-engineered
to handle concurrency with other backends.

3. Just live with the waste of space.

I would guess that (1) is easier than (2).  Also, (2) might provoke
contention while writing tapes that is otherwise completely
unnecessary.  It seems silly to have multiple backends fighting over
the same end-of-file pointer for the same file when they could just
write to different files instead.

Another tangentially-related problem I just realized is that we need
to somehow handle the issues that tqueue.c does when transferring
tuples between backends -- most of the time there's no problem, but if
anonymous record types are involved then tuples require "remapping".
It's probably harder to provoke a failure in the tuplesort case than
with parallel query per se, but it's probably not impossible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> 3. Just live with the waste of space.

I am loathe to create a special case for the parallel interface too,
but I think it's possible that *no* caller will ever actually need to
live with this restriction at any time in the future. I am strongly
convinced that adopting tuplesort.c for parallelism should involve
partitioning [1]. With that approach, even randomAccess callers will
not want to read at random for one big materialized tape, since that's
at odds with the whole point of partitioning, which is to remove any
dependencies between workers quickly and early, so that as much work
as possible is pushed down into workers. If a merge join were
performed in a world where we have this kind of partitioning, we
definitely wouldn't require one big materialized tape that is
accessible within each worker.

What are the chances of any real user actually having to live with the
waste of space at some point in the future?

> Another tangentially-related problem I just realized is that we need
> to somehow handle the issues that tqueue.c does when transferring
> tuples between backends -- most of the time there's no problem, but if
> anonymous record types are involved then tuples require "remapping".
> It's probably harder to provoke a failure in the tuplesort case than
> with parallel query per se, but it's probably not impossible.

Thanks for pointing that out. I'll look into it.

BTW, I discovered a bug where there is very low memory available
within each worker -- tuplesort.c throws an error within workers
immediately. It's just a matter of making sure that they at least have
64KB of workMem, which is a pretty straightforward fix. Obviously it
makes no sense to use so little memory in the first place; this is a
corner case.

[1] https://www.postgresql.org/message-id/CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com
-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Dec 21, 2016 at 10:21 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 3. Just live with the waste of space.
>
> I am loathe to create a special case for the parallel interface too,
> but I think it's possible that *no* caller will ever actually need to
> live with this restriction at any time in the future.

I just realized that you were actually talking about the waste of
space in workers here, as opposed to the theoretical waste of space
that would occur in the leader should there ever be a parallel
randomAccess tuplesort caller.

To be clear, I am totally against allowing a waste of logtape.c temp
file space in *workers*, because that implies a cost that will most
certainly be felt by users all the time.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Dec 20, 2016 at 5:14 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Imagine a data structure that is stored in dynamic shared memory and
>> contains space for a filename, a reference count, and a mutex.  Let's
>> call this thing a SharedTemporaryFile or something like that.  It
>> offers these APIs:
>>
>> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
>> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
>> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
>> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);
>
> I'm a little bit tired right now, and I have yet to look at Thomas'
> parallel hash join patch in any detail. I'm interested in what you
> have to say here, but I think that I need to learn more about its
> requirements in order to have an informed opinion.

Attached is V7 of the patch. The overall emphasis with this revision
is on bringing clarity on how much can be accomplished using
generalized infrastructure, explaining the unification mechanism
coherently, and related issues.

Notable changes
---------------

* Rebased to work with the newly simplified logtape.c representation
(the recent removal of "indirect blocks" by Heikki). Heikki's work was
something that helped with simplifying the whole unification
mechanism, to a significant degree. I think that there was over a 50%
reduction in logtape.c lines of code in this revision.

* randomAccess cases are now able to reclaim disk space from blocks
originally written by workers. This further simplifies logtape.c
changes significantly. I don't think that this is important because
some future randomAccess caller might otherwise have double the
storage overhead for their parallel sort, or even because of the
disproportionate performance penalty such a caller would experience;
rather, it's important because it removes previous special cases (that
were internal to logtape.c).

For example, aside from the fact that worker tapes within a unified
tapeset will often have a non-zero offset, there is no state that
actually remembers that this is a unified tapeset, because that isn't
needed anymore. And, even though we reclaim blocks from workers, we
only have one central chokepoint for applying worker offsets in the
leader (that chokepoint is ltsReadFillBuffer()). Routines tasked with
things like positional seeking for mark/restore for certain tuplesort
clients (which are. in general, poorly tested) now need to have no
knowledge of unification while still working just the same. This is a
consequence of the fact that ltsWriteBlock() callers (and
ltsWriteBlock() itself) never have to think about offsets. I'm pretty
happy about that.

* pg_restore now prevents the planner from deciding that parallelism
should be used, in order to make restoration behavior more consistent
and predictable. Iff a dump being restored happens to have a CREATE
INDEX with the new index storage parameter parallel_workers set, then
pg_restore will use parallel CREATE INDEX. This is accomplished with a
new GUC, enable_parallelddl (since "max_parallel_workers_maintenance =
0" will disable parallel CREATE INDEX across the board, ISTM that a
second new GUC is required). I think that this behavior the right
trade-off for pg_restore goes, although I still don't feel
particularly strongly about it. There is now a concrete proposal on
what to do about pg_restore, if nothing else. To recap, the general
concern address here is that there are typically no ANALYZE stats
available for the planner to decide with when pg_restore runs CREATE
INDEX, although that isn't always true, which was both surprising and
inconsistent.

* Addresses the problem of anonymous record types and their need for
"remapping" across parallel workers. I've simply pushed the
responsibility on callers within tuplesort.h contract; parallel CREATE
INDEX callers don't need to care about this, as explained there.
(CLUSTER tuplesorts would also be safe.)

* Puts the whole rationale for unification into one large comment
above the function BufFileUnify(), and removes traces of the same kind
of discussion from everywhere else. I think that buffile.c is the
right central place to discuss the unification mechanism, now that
logtape.c has been greatly simplified. All the fd.c changes are in
routines that are only ever called by buffile.c anyway, and are not
too complicated (in general, temp fd.c files are only ever owned
transitively, through BufFiles). So, morally, the unification
mechanism is something that wholly belongs to buffile.c, since
unification is all about temp files, and buffile.h is the interface
through which temp files are owned and accessed in general, without
exception.

Unification remains specialized
-------------------------------

On the one hand, BufFileUnify() now describes the whole idea of
unification in detail, in its own general terms, including its
performance characteristics, but on the other hand it doesn't pretend
to be more general than it is (that's why we really have to talk about
performance characteristics). It doesn't go as far as admitting to
being the thing that logtape.c uses for parallel sort, but even that
doesn't seem totally unreasonable to me. I think that BufFileUnify()
might also end up being used by tuplestore.c, so it isn't entirely
non-general, but I now realize that it's unlikely to be used by
parallel hash join. So, while randomAccess reclamation of worker
blocks within the leader now occurs, I have not followed Robert's
suggestion in full. For example, I didn't do this: "ltsGetFreeBlock()
need to be re-engineered to handle concurrency with other backends".
The more I've thought about it, the more appropriate the kind of
specialization I've come up with seems. I've concluded:

- Sorting is important, and therefore worth adding non-general
infrastructure in support of. It's important enough to have its own
logtape.c module, so why not this? Much of buffile.c was explicitly
written with sorting and hashing in mind from the beginning. We use
BufFiles for other things, but those two things are by far the two
most important users of temp files, and the only really compelling
candidates for parallelization.

- There are limited opportunities to share BufFile infrastructure for
parallel sorting and parallel hashing. Hashing is inverse to sorting
conceptually, so it should not be surprising that this is the case. By
that I mean that hashing is characterized by logical division and
physical combination, whereas sorting is characterized by physical
division and logical combination. Parallel tuplesort naturally allows
each worker to do an enormous amount of work with whatever data it is
fed by the parallel heap scan that it joins, *long* before the data
needs to be combined with data from other workers in any way.

Consider this code from Thomas' parallel hash join patch:

> +bool
> +ExecHashCheckForEarlyExit(HashJoinTable hashtable)
> +{
> +   /*
> +    * The golden rule of leader deadlock avoidance: since leader processes
> +    * have two separate roles, namely reading from worker queues AND executing
> +    * the same plan as workers, we must never allow a leader to wait for
> +    * workers if there is any possibility those workers have emitted tuples.
> +    * Otherwise we could get into a situation where a worker fills up its
> +    * output tuple queue and begins waiting for the leader to read, while
> +    * the leader is busy waiting for the worker.
> +    *
> +    * Parallel hash joins with shared tables are inherently susceptible to
> +    * such deadlocks because there are points at which all participants must
> +    * wait (you can't start check for unmatched tuples in the hash table until
> +    * probing has completed in all workers, etc).

Parallel sort will never have to do anything like this. There is
minimal IPC before the leader's merge, and the dependencies between
phases are extremely simple (there is only one; workers need to finish
before leader can merge, and must stick around in a quiescent state
throughout). Data throughput is what tuplesort cares about; it doesn't
really care about latency. Whereas, I gather that there needs to be
continual gossip between hash join workers (those building a hash
table) about the number of batches. They don't have to be in perfect
lockstep, but they need to cooperate closely; the IPC is pretty eager,
and therefore latency sensitive. Thomas makes use of atomic ops in his
patch, which makes sense, but I'd never bother with anything like that
for parallel tuplesort; there'd be no measurable benefit there.

In general, it's not obvious to me that the SharedTemporaryFile() API
that Robert sketched recently (or any very general shared file
interface that does things like buffer writes in shared memory, uses a
shared read pointer, etc) is right for either parallel hash join or
parallel sort. I don't see that there is much to be said for a
reference count mechanism for parallel sort BufFiles, since the
dependencies are so simple and fixed, and for hash join, a much
tighter mechanism seems desirable. I can't think why Thomas would want
a shared read pointer, since the way he builds the shared hash table
leaves it immutable once probing is underway; ISTM that he'll want
that kind of mechanism to operate at a higher level, in a more
specialized way.

That said, I don't actually know what Thomas has in mind for
multi-batch parallel hash joins, since that's only a TODO item in the
most recent revision of his patch (maybe I missed something he wrote
on this topic, though). Thomas is working on a revision that resolves
that open item, at which point we'll know more. I understand that a
new revision of his patch that closes out the TODO item isn't too far
from being posted.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached is V7 of the patch.

I am doing some testing.  First, some superficial things from first pass:

Still applies with some offsets and one easy-to-fix rejected hunk in
nbtree.c (removing some #include directives and a struct definition).

+/* Sort parallel code from state for sort__start probes */
+#define PARALLEL_SORT(state)   ((state)->shared == NULL ? 0 : \
+                                (state)->workerNum >= 0 : 1 : 2)

Typo: ':' instead of '?', --enable-dtrace build fails.

+         the entire utlity command, regardless of the number of

Typo: s/utlity/utility/

+   /* Perform sorting of spool, and possibly a spool2 */
+   sortmem = Max(maintenance_work_mem / btshared->scantuplesortstates, 64);

Just an observation:  if you ask for a large number of workers, but
only one can be launched, it will be constrained to a small fraction
of maintenance_work_mem, but use only one worker.  That's probably OK,
and I don't see how to do anything about it unless you are prepared to
make workers wait for an initial message from the leader to inform
them how many were launched.

Should this 64KB minimum be mentioned in the documentation?

+   if (!btspool->isunique)
+   {
+       shm_toc_estimate_keys(&pcxt->estimator, 2);
+   }

Project style: people always tell me to drop the curlies in cases like
that.  There are a few more examples in the patch.

+ /* Wait for workers */
+ ConditionVariableSleep(&shared->workersFinishedCv,
+   WAIT_EVENT_PARALLEL_FINISH);

I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in
tuplesort_leader_wait and worker_wait.  That belongs to
WaitForParallelWorkersToFinish, so someone who see that in
pg_stat_activity won't know which it is.

IIUC worker_wait() is only being used to keep the worker around so its
files aren't deleted.  Once buffile cleanup is changed to be
ref-counted (in an on_dsm_detach hook?) then workers might as well
exit sooner, freeing up a worker slot... do I have that right?

Incidentally, barrier.c could probably be used for this
synchronisation instead of these functions.  I think
_bt_begin_parallel would call BarrierInit(&shared->barrier,
scantuplesortstates) and then after LaunchParallelWorkers() it'd call
a new interface BarrierDetachN(&shared->barrier, scantuplesortstates -
pcxt->nworkers_launched) to forget about workers that failed to
launch.  Then you could use BarrierWait where the leader waits for the
workers to finish, and BarrierDetach where the workers are finished
and want to exit.

+ /* Prepare state to create unified tapeset */
+ leaderTapes = palloc(sizeof(TapeShare) * state->maxTapes);

Missing cast (TapeShare *) here?  Project style judging by code I've
seen, and avoids gratuitously C++ incompatibility.

+_bt_parallel_shared_estimate(Snapshot snapshot)
...
+tuplesort_estimate_shared(int nWorkers)

Inconsistent naming?

More soon.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Attached is V7 of the patch.
>
> I am doing some testing.  First, some superficial things from first pass:
>
> [Various minor cosmetic issues]

Oops.

> Just an observation:  if you ask for a large number of workers, but
> only one can be launched, it will be constrained to a small fraction
> of maintenance_work_mem, but use only one worker.  That's probably OK,
> and I don't see how to do anything about it unless you are prepared to
> make workers wait for an initial message from the leader to inform
> them how many were launched.

Actually, the leader-owned worker Tuplesort state will have the
appropriate amount, so you'd still need to have 2 participants (1
worker + leader-as-worker). And, sorting is much less sensitive to
having a bit less memory than hashing (at least when there isn't
dozens of runs to merge in the end, or multiple passes). So, I agree
that this isn't worth worrying about for a DDL statement.

> Should this 64KB minimum be mentioned in the documentation?

You mean user-visible documentation, and not just tuplesort.h? I don't
think that that's necessary. That's a ludicrously low amount of memory
for a worker to be limited to anyway. It will never come up with
remotely sensible use of the feature.

> +   if (!btspool->isunique)
> +   {
> +       shm_toc_estimate_keys(&pcxt->estimator, 2);
> +   }
>
> Project style: people always tell me to drop the curlies in cases like
> that.  There are a few more examples in the patch.

I only do this when there is an "else" that must have curly braces,
too. There are plenty of examples of this from existing code, so I
think it's fine.

> + /* Wait for workers */
> + ConditionVariableSleep(&shared->workersFinishedCv,
> +   WAIT_EVENT_PARALLEL_FINISH);
>
> I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in
> tuplesort_leader_wait and worker_wait.  That belongs to
> WaitForParallelWorkersToFinish, so someone who see that in
> pg_stat_activity won't know which it is.

Noted.

> IIUC worker_wait() is only being used to keep the worker around so its
> files aren't deleted.  Once buffile cleanup is changed to be
> ref-counted (in an on_dsm_detach hook?) then workers might as well
> exit sooner, freeing up a worker slot... do I have that right?

Yes. Or at least I think it's very likely that that will end up happening.

> Incidentally, barrier.c could probably be used for this
> synchronisation instead of these functions.  I think
> _bt_begin_parallel would call BarrierInit(&shared->barrier,
> scantuplesortstates) and then after LaunchParallelWorkers() it'd call
> a new interface BarrierDetachN(&shared->barrier, scantuplesortstates -
> pcxt->nworkers_launched) to forget about workers that failed to
> launch.  Then you could use BarrierWait where the leader waits for the
> workers to finish, and BarrierDetach where the workers are finished
> and want to exit.

I thought about doing that, actually, but I don't like creating
dependencies on some other uncommited patch, which is a moving target
(barrier stuff isn't committed yet). It makes life difficult for
reviewers. I put off adopting condition variables until they were
committed for the same reason -- it's was easy to do without them for
a time. I'll probably get around to it before too long, but feel no
urgency about it. Barriers will only allow me to make a modest net
removal of code, AFAIK.

Thanks
-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Jan 31, 2017 at 12:15 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Should this 64KB minimum be mentioned in the documentation?
>
> You mean user-visible documentation, and not just tuplesort.h? I don't
> think that that's necessary. That's a ludicrously low amount of memory
> for a worker to be limited to anyway. It will never come up with
> remotely sensible use of the feature.

I agree.

>> +   if (!btspool->isunique)
>> +   {
>> +       shm_toc_estimate_keys(&pcxt->estimator, 2);
>> +   }
>>
>> Project style: people always tell me to drop the curlies in cases like
>> that.  There are a few more examples in the patch.
>
> I only do this when there is an "else" that must have curly braces,
> too. There are plenty of examples of this from existing code, so I
> think it's fine.

But I disagree on this one.  I think

if (blah)   stuff();
else
{   thing();   gargle();
}

...is much better than

if (blah)
{   stuff();
}
else
{   thing();   gargle();
}

But if there were a comment on a separate line before the call to
stuff(), then I would do it the second way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Michael Paquier
Date:
On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> Attached is V7 of the patch.
>>
>> I am doing some testing.  First, some superficial things from first pass:
>>
>> [Various minor cosmetic issues]
>
> Oops.

As this review is very recent, I have moved the patch to CF 2017-03.
-- 
Michael



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Feb 1, 2017 at 5:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>>> Attached is V7 of the patch.
>>>
>>> I am doing some testing.  First, some superficial things from first pass:
>>>
>>> [Various minor cosmetic issues]
>>
>> Oops.
>
> As this review is very recent, I have moved the patch to CF 2017-03.
ParallelContext *
-CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers)
+CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers,
+                                         bool serializable_okay){       MemoryContext oldcontext;
ParallelContext*pcxt;
 
@@ -143,7 +144,7 @@ CreateParallelContext(parallel_worker_main_type
entrypoint, int nworkers)        * workers, at least not until somebody enhances that mechanism to be        *
parallel-aware.       */
 
-       if (IsolationIsSerializable())
+       if (IsolationIsSerializable() && !serializable_okay)               nworkers = 0;

That's a bit weird but I can't think of a problem with it.  Workers
run with MySerializableXact == InvalidSerializableXact, even though
they may have the snapshot of a SERIALIZABLE leader.  Hopefully soon
the restriction on SERIALIZABLE in parallel queries can be lifted
anyway, and then this could be removed.

Here are some thoughts on the overall approach.  Disclaimer:  I
haven't researched the state of the art in parallel sort or btree
builds.  But I gather from general reading that there are a couple of
well known approaches, and I'm sure you'll correct me if I'm off base
here.

1.  All participants: parallel sequential scan, repartition on the fly
so each worker has tuples in a non-overlapping range, sort, build
disjoint btrees; barrier; leader: merge disjoint btrees into one.

2.  All participants: parallel sequential scan, sort, spool to disk;
barrier; leader: merge spooled tuples and build btree.

This patch is doing the 2nd thing.  My understanding is that some
systems might choose to do that if they don't have or don't like the
table's statistics, since repartitioning for balanced load requires
carefully chosen ranges and is highly sensitive to distribution
problems.

It's pretty clear that approach 1 is a difficult project.  From my
research into dynamic repartitioning in the context of hash joins, I
can see that that infrastructure is a significant project in its own
right: subproblems include super efficient tuple exchange, buffering,
statistics/planning and dealing with/adapting to bad outcomes.  I also
suspect that repartitioning operators might need to be specialised for
different purposes like sorting vs hash joins, which may have
differing goals.  I think it's probably easy to build a slow dynamic
repartitioning mechanism that frequently results in terrible worst
case scenarios where you paid a fortune in IPC overheads and still
finished up with one worker pulling most of the whole load.  Without
range partitioning, I don't believe you can merge the resulting
non-disjoint btrees efficiently so you'd probably finish up writing a
complete new btree to mash them together.  As for merging disjoint
btrees, I assume there are ways to do a structure-preserving merge
that just rebuilds some internal pages and incorporates the existing
leaf pages directly, a bit like tree manipulation in functional
programming languages; that'll take some doing.

So I'm in favour of this patch, which is relatively simple and give us
faster index builds soon.  Eventually we might also be able to have
approach 1.  From what I gather, it's entirely possible that we might
still need 2 to fall back on in some cases.

Will you move the BufFile changes to a separate patch in the next revision?

Still testing and reviewing, more soon.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> 2.  All participants: parallel sequential scan, sort, spool to disk;
> barrier; leader: merge spooled tuples and build btree.
>
> This patch is doing the 2nd thing.  My understanding is that some
> systems might choose to do that if they don't have or don't like the
> table's statistics, since repartitioning for balanced load requires
> carefully chosen ranges and is highly sensitive to distribution
> problems.

The second thing here seems to offer comparable scalability to other
system implementation's of the first thing. They seem to have reused
"partitioning to sort in parallel" for B-Tree builds, at least in some
cases, despite this. WAL logging is the biggest serial bottleneck here
for other systems, I've heard -- that's still going to be pretty much
serial.

I think that the fact that some systems do partitioning for parallel
B-Tree builds might have as much to do with their ability to create
B-Tree indexes in place as anything else. Apparently, some systems
don't use temp files, instead writing out what is for all intents and
purposes part of a finished B-Tree as runs (no use of
temp_tablespaces). That may be a big part of what makes it worthwhile
to try to use partitioning. I understand that only the highest client
counts will see much direct performance benefit relative to the first
approach.

> It's pretty clear that approach 1 is a difficult project.  From my
> research into dynamic repartitioning in the context of hash joins, I
> can see that that infrastructure is a significant project in its own
> right: subproblems include super efficient tuple exchange, buffering,
> statistics/planning and dealing with/adapting to bad outcomes.  I also
> suspect that repartitioning operators might need to be specialised for
> different purposes like sorting vs hash joins, which may have
> differing goals.  I think it's probably easy to build a slow dynamic
> repartitioning mechanism that frequently results in terrible worst
> case scenarios where you paid a fortune in IPC overheads and still
> finished up with one worker pulling most of the whole load.  Without
> range partitioning, I don't believe you can merge the resulting
> non-disjoint btrees efficiently so you'd probably finish up writing a
> complete new btree to mash them together.  As for merging disjoint
> btrees, I assume there are ways to do a structure-preserving merge
> that just rebuilds some internal pages and incorporates the existing
> leaf pages directly, a bit like tree manipulation in functional
> programming languages; that'll take some doing.

I agree with all that. "Stitching together" disjoint B-Trees does seem
to have some particular risks, which users of other systems are
cautioned against in their documentation. You can end up with an
unbalanced B-Tree.

> So I'm in favour of this patch, which is relatively simple and give us
> faster index builds soon.  Eventually we might also be able to have
> approach 1.  From what I gather, it's entirely possible that we might
> still need 2 to fall back on in some cases.

Right. And it can form the basis of an implementation of 1, which in
any case seems to be much more compelling for parallel query, when a
great deal more can be pushed down, and we are not particularly likely
to be I/O bound (usually not much writing to the heap, or WAL
logging).

> Will you move the BufFile changes to a separate patch in the next revision?

That is the plan. I need to get set up with a new machine here, having
given back my work laptop to Heroku, but it shouldn't take too long.

Thanks for the review.
-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Feb 1, 2017 at 8:46 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> So I'm in favour of this patch, which is relatively simple and give us
>> faster index builds soon.  Eventually we might also be able to have
>> approach 1.  From what I gather, it's entirely possible that we might
>> still need 2 to fall back on in some cases.
>
> Right. And it can form the basis of an implementation of 1, which in
> any case seems to be much more compelling for parallel query, when a
> great deal more can be pushed down, and we are not particularly likely
> to be I/O bound (usually not much writing to the heap, or WAL
> logging).

I ran some tests today.  First I created test tables representing the
permutations of these choices:

Table structure:

  int = Integer key only
  intwide = Integer key + wide row
  text = Text key only (using dictionary words)
  textwide = Text key + wide row

Uniqueness:

  u = each value unique
  d = 10 duplicates of each value

Heap physical order:

  rand = Random
  asc = Ascending order (already sorted)
  desc = Descending order (sorted backwards)

I used 10 million rows for this test run, so that gave me 24 tables of
the following sizes as reported in "\d+":

  int tables = 346MB each
  intwide tables = 1817MB each
  text tables = 441MB each
  textwide tables = 1953MB each

It'd be interesting to test larger tables of course but I had a lot of
permutations to get through.

For each of those tables I ran tests corresponding to the permutations
of these three variables:

Index type:

  uniq = CREATE UNIQUE INDEX ("u" tables only, ie no duplicates)
  nonu = CREATE INDEX ("u" and "d" tables)

Maintenance memory: 1M, 64MB, 256MB, 512MB

Workers: from 0 up to 8

Environment:  EDB test machine "cthulhu", Intel(R) Xeon(R) CPU E7-
8830  @ 2.13GHz, 8 socket, 8 cores (16 threads) per socket, CentOS
7.2, Linux kernel 3.10.0-229.7.2.el7.x86_64, 512GB RAM, pgdata on SSD.
Database initialised with en_US.utf-8 collation, all defaults except
max_wal_size increased to 4GB (otherwise warnings about too frequent
checkpoints) and max_parallel_workers_maintenance = 8.  Testing done
with warm OS cache.

I applied your v2 patch on top of
7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict.  It
still had a couple of harmless conflicts that I was able to deal with
(not code, just some header stuff moving around).

See full results from all permutations attached, but I wanted to
highlight the measurements from 'textwide', 'u', 'nonu' which show
interesting 'asc' numbers (data already sorted).  The 'mem' column is
maintenance_work_mem in megabytes.  The 'w = 0' column shows the time
in seconds for parallel_workers = 0.  The other 'w = N' columns show
times with higher parallel_workers settings, represented as speed-up
relative to the 'w = 0' time.

1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
show speed-up relative to that time):

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
  64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
 256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
 512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x

2. 'rand' = randomised data:

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 130.25 | 1.82x | 2.19x | 2.52x | 2.58x | 2.72x | 2.72x | 2.83x | 2.89x
  64 | 117.36 | 1.80x | 2.20x | 2.43x | 2.47x | 2.55x | 2.51x | 2.59x | 2.69x
 256 | 124.68 | 1.87x | 2.20x | 2.49x | 2.52x | 2.64x | 2.70x | 2.72x | 2.75x
 512 | 115.77 | 1.51x | 1.72x | 2.14x | 2.08x | 2.19x | 2.31x | 2.44x | 2.48x

3. 'desc' = reverse-sorted data:

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 115.19 | 1.88x | 2.39x | 2.78x | 3.50x | 3.62x | 4.20x | 4.19x | 4.39x
  64 | 112.17 | 1.85x | 2.25x | 2.99x | 3.63x | 3.65x | 4.01x | 4.31x | 4.62x
 256 | 119.55 | 1.76x | 2.21x | 2.85x | 3.43x | 3.37x | 3.77x | 4.24x | 4.28x
 512 | 119.50 | 1.85x | 2.19x | 2.87x | 3.26x | 3.28x | 3.74x | 4.24x | 3.93x

The 'asc' effects are much less pronounced when the key is an int.
Here is the equivalent data for 'intwide', 'u', 'nonu':

1.  'asc'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 12.19 | 1.55x | 1.93x | 2.21x | 2.44x | 2.64x | 2.76x | 2.91x | 2.83x
  64 |  7.35 | 1.29x | 1.53x | 1.69x | 1.86x | 1.98x | 2.04x | 2.07x | 2.09x
 256 |  7.34 | 1.26x | 1.47x | 1.64x | 1.79x | 1.92x | 1.96x | 1.98x | 2.02x
 512 |  7.24 | 1.24x | 1.46x | 1.65x | 1.80x | 1.91x | 1.97x | 2.00x | 1.92x

2. 'rand'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 15.16 | 1.56x | 2.01x | 2.32x | 2.57x | 2.73x | 2.87x | 2.95x | 2.91x
  64 | 12.97 | 1.55x | 1.97x | 2.25x | 2.44x | 2.58x | 2.70x | 2.74x | 2.71x
 256 | 13.14 | 1.47x | 1.86x | 2.12x | 2.31x | 2.50x | 2.62x | 2.58x | 2.69x
 512 | 13.61 | 1.48x | 1.91x | 2.22x | 2.37x | 2.55x | 2.65x | 2.73x | 2.73x

3. 'desc'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 13.45 | 1.51x | 1.94x | 2.31x | 2.56x | 2.75x | 2.95x | 3.05x | 3.00x
  64 | 10.27 | 1.42x | 1.82x | 2.05x | 2.30x | 2.46x | 2.59x | 2.64x | 2.65x
 256 | 10.52 | 1.39x | 1.70x | 2.02x | 2.24x | 2.34x | 2.39x | 2.48x | 2.56x
 512 | 10.62 | 1.43x | 1.82x | 2.06x | 2.32x | 2.51x | 2.61x | 2.68x | 2.69x

Full result summary and scripts used for testing attached.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I applied your v2 patch on top of
> 7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict.  It
> still had a couple of harmless conflicts that I was able to deal with
> (not code, just some header stuff moving around).
You must mean my V7 patch. FWIW, I've resolved the conflicts with
7ac4a389a7dbddaa8b19deb228f0a988e79c5795 in my own private branch, and
have worked through some of the open items that you raised.

> See full results from all permutations attached, but I wanted to
> highlight the measurements from 'textwide', 'u', 'nonu' which show
> interesting 'asc' numbers (data already sorted).  The 'mem' column is
> maintenance_work_mem in megabytes.  The 'w = 0' column shows the time
> in seconds for parallel_workers = 0.  The other 'w = N' columns show
> times with higher parallel_workers settings, represented as speed-up
> relative to the 'w = 0' time.

The thing to keep in mind about testing presorted cases in tuplesort
in general is that we have this weird precheck for presorted input in
our qsort. This is something added by us to the original Bentley &
McIlroy algorithm in 2006. I am very skeptical of this addition, in
general. It tends to have the effect of highly distorting how
effective most optimizations are for presorted cases, which comes up
again and again. It only works when the input is *perfectly*
presorted, but can throw away an enormous amount of work when the last
tuple of input is out of order -- that will throw away all work before
that point (not so bad when you think your main cost is comparisons
rather than memory accesses, but that isn't the case).

Your baseline case can either be made unrealistically fast due to the
fact that you get a perfectly sympathetic case for this optimization,
or unrealistically slow (very CPU bound) due to the fact that you have
that one last tuple out of place. I once said that this last tuple can
act like a discarded banana skin.

There is nothing wrong with the idea of exploiting presortedness, and
to some extent the original algorithm does that (by using insertion
sort), but an optimization along the lines of Timsort's "galloping
mode" (which is what this modification of ours attempts) requires
non-trivial bookkeeping to do right.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
> show speed-up relative to that time):
>
>  mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
> -----+--------+-------+-------+-------+-------+-------+-------+-------+-------
>    1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
>   64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
>  256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
>  512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x

I think that this presorted case doesn't improve much because the
sorting itself is so cheap, as explained in my last mail. However, the
improvements as workers are added is still smaller than expected. I
think that this indicates that there isn't enough I/O capacity
available here to truly show the full potential of the patch -- I've
certainly seen better scalability for cases like this when there is a
lot of I/O bandwidth available, and I/O parallelism is there to be
taken advantage of. Say, when using a system with a large RAID array
(I used a RAID0 array with 12 HDDs for my own tests). Another issue is
that you probably don't have enough data here to really show off the
patch. I don't want to dismiss the benchmark, which is still quite
informative, but it's worth pointing out that the feature is going to
be most compelling for very large indexes, that will take at least
several minutes to build under any circumstances. (Having a
reproducible case is also important, which what you have here has
going for it, on the other hand.)

I suspect that this system isn't particularly well balanced for the
task of benchmarking the patch. You would probably see notably better
scalability than any you've shown in any test if you could add
additional sequential I/O bandwidth, which is probably an economical,
practical choice for many users. I suspect that you aren't actually
saturating available CPUs to the greatest extent that the
implementations makes possible.

Another thing I want to point out is that with 1MB of
maintenance_work_mem, the patch appears to do very well, but that
isn't terribly meaningful. I would suggest that we avoid testing this
patch with such a low amount of memory -- it doesn't seem important.
This is skewed by the fact that you're using replacement selection in
the serial case only. I think what this actually demonstrates is that
replacement selection is very slow, even with its putative best case.
I believe that commit 2459833 was the final nail in the coffin of
replacement selection. I certainly don't want to relitigate the
discussion on replacement_sort_tuples, and am not going to push too
hard, but ISTM that we should fully remove replacement selection from
tuplesort.c and be done with it.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Sat, Feb 4, 2017 at 11:58 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
>> show speed-up relative to that time):
>>
>>  mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
>> -----+--------+-------+-------+-------+-------+-------+-------+-------+-------
>>    1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
>>   64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
>>  256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
>>  512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x
>
> I think that this presorted case doesn't improve much because the
> sorting itself is so cheap, as explained in my last mail. However, the
> improvements as workers are added is still smaller than expected. I
> think that this indicates that there isn't enough I/O capacity
> available here to truly show the full potential of the patch -- I've
> certainly seen better scalability for cases like this when there is a
> lot of I/O bandwidth available, and I/O parallelism is there to be
> taken advantage of. Say, when using a system with a large RAID array
> (I used a RAID0 array with 12 HDDs for my own tests). Another issue is
> that you probably don't have enough data here to really show off the
> patch. I don't want to dismiss the benchmark, which is still quite
> informative, but it's worth pointing out that the feature is going to
> be most compelling for very large indexes, that will take at least
> several minutes to build under any circumstances. (Having a
> reproducible case is also important, which what you have here has
> going for it, on the other hand.)

Right.  My main reason for starting smallish was to allow me to search
a space with several variables without waiting eons.  Next I would
like to run a small subset of those tests with, say, 10, 20 or even
100 times more data loaded, so the tables would be ~20GB, ~40GB or
~200GB.

About read bandwidth:  It shouldn't have been touching the disk at all
for reads: I did a dummy run of the index build before the measured
runs, so that a 2GB table being sorted in ~2 minutes would certainly
have come entirely from the OS page cache since the machine has oodles
of RAM.

About write bandwidth:  The WAL, the index and the temp files all went
to an SSD array, though I don't have the characteristics of that to
hand.  I should also be able to test on multi-spindle HDD array.  I
doubt either can touch your 12 way RAID0 array, but will look into
that.

> I suspect that this system isn't particularly well balanced for the
> task of benchmarking the patch. You would probably see notably better
> scalability than any you've shown in any test if you could add
> additional sequential I/O bandwidth, which is probably an economical,
> practical choice for many users. I suspect that you aren't actually
> saturating available CPUs to the greatest extent that the
> implementations makes possible.

I will look into what IO options I can access before running larger
tests.  Also I will look into running the test with both cold and warm
caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth
enters the picture.

> Another thing I want to point out is that with 1MB of
> maintenance_work_mem, the patch appears to do very well, but that
> isn't terribly meaningful. I would suggest that we avoid testing this
> patch with such a low amount of memory -- it doesn't seem important.
> This is skewed by the fact that you're using replacement selection in
> the serial case only. I think what this actually demonstrates is that
> replacement selection is very slow, even with its putative best case.
> I believe that commit 2459833 was the final nail in the coffin of
> replacement selection. I certainly don't want to relitigate the
> discussion on replacement_sort_tuples, and am not going to push too
> hard, but ISTM that we should fully remove replacement selection from
> tuplesort.c and be done with it.

Interesting.  I haven't grokked this but will go and read about it.

Based on your earlier comments about banana skin effects, I'm
wondering if it would be interesting to add a couple more heap
distributions to the test set that are almost completely sorted except
for a few entries out of order.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 3, 2017 at 4:15 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> I suspect that this system isn't particularly well balanced for the
>> task of benchmarking the patch. You would probably see notably better
>> scalability than any you've shown in any test if you could add
>> additional sequential I/O bandwidth, which is probably an economical,
>> practical choice for many users. I suspect that you aren't actually
>> saturating available CPUs to the greatest extent that the
>> implementations makes possible.
>
> I will look into what IO options I can access before running larger
> tests.  Also I will look into running the test with both cold and warm
> caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth
> enters the picture.

It might just have been that the table was too small to be an
effective target for parallel sequential scan with so many workers,
and so a presorted best case CREATE INDEX, which isn't that different,
also fails to see much benefit (compared to what you'd see with a
similar case involving a larger table). In other words, I might have
jumped the gun in emphasizing issues with hardware and I/O bandwidth
over issues around data volume (that I/O parallelism is inherently not
very helpful with these relatively small tables).

As I've pointed out a couple of times before, bigger sorts will be
more CPU bound because sorting itself has costs that grow
linearithmically, whereas writing out runs has costs that grow
linearly. The relative cost of the I/O can be expected to go down as
input goes up for this reason. At the same time, a larger input might
make better use of I/O parallelism, which reduces the cost paid in
latency to write out runs in absolute terms.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Jan 30, 2017 at 9:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> IIUC worker_wait() is only being used to keep the worker around so its
>> files aren't deleted.  Once buffile cleanup is changed to be
>> ref-counted (in an on_dsm_detach hook?) then workers might as well
>> exit sooner, freeing up a worker slot... do I have that right?
>
> Yes. Or at least I think it's very likely that that will end up happening.

I've looked into this, and have a version of the patch where clean-up
occurs when the last backend with a reference to the BufFile goes
away. It seems robust; all of my private tests pass, which includes
things that parallel CREATE INDEX won't use, and yet is added as
infrastructure (e.g., randomAccess recycling of blocks by leader from
workers).

As Thomas anticipated, worker_wait() now only makes workers wait until
the leader comes along to take a reference to their files, at which
point the worker processes can go away. In effect, the worker
processes go away as soon as possible, just as the leader begins its
final on-the-fly merge. At that point, they could be reused by some
other process, of course.

However, there are some specific implementation issues with this that
I didn't quite anticipate. I would like to get feedback on these
issues now, from both Thomas and Robert. The issues relate to how much
the patch can or should "buy into resource management". You might
guess that this new resource management code is something that should
live in fd.c, alongside the guts of temp file resource management,
within the function FileClose(). That way, it would be called by every
possible path that might delete a temp file, including
ResourceOwnerReleaseInternal(). That's not what I've done, though.
Instead, refcount management is limited to a few higher level routines
in buffile.c. Initially, resource management in FileClose() is made to
assume that it must delete the file. Then, if and when directed to by
BufFileClose()/refcount, a backend may determine that it is not its
job to do the deletion -- it will not be the one that must "turn out
the lights", and so indicates to FileClose() that it should not delete
the file after all (it should just release vFDs, close(), and so on).
Otherwise, when refcount reaches zero, temp files are deleted by
FileClose() in more or less the conventional manner.

The fact that there could, in general, be any error that causes us to
attempt a double-deletion (deletion of a temp file from more than one
backend) for a time is less of a problem than you might think. This is
because there is a risk of this only for as long as two backends hold
open the file at the same time. In the case of parallel CREATE INDEX,
this is now the shortest possible period of time, since workers close
their files using BufFileClose() immediately after the leader wakes
them up from a quiescent state. And, if that were to actually happen,
say due to some random OOM error during that small window, the
consequence is no worse than an annoying log message: "could not
unlink file..." (this would come from the second backend that
attempted an unlink()). You would not see this when a worker raised an
error due to a duplicate violation, or any other routine problem, so
it should really be almost impossible.

That having been said, this probably *is* a problematic restriction in
cases where a temp file's ownership is not immediately handed over
without concurrent sharing. What happens to be a small window for the
parallel CREATE INDEX patch probably wouldn't be a small window for
parallel hash join.   :-(

It's not hard to see why I would like to do things this way. Just look
at ResourceOwnerReleaseInternal(). Any release of a file happens
during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic
shared memory segments happens earlier, during
RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to
implement a refcount is using dynamic shared memory, and that seems
hard. There are additional reasons why I suggest we go this way, such
as the fact that all the relevant state belongs to BufFile, which is
implemented a layer above all of the guts of resource management of
temp files within fd.c. I'd have to replicate almost all state in fd.c
to make it all work, which seems like a big modularity violation.

Does anyone have any suggestions on how to tackle this?

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> However, there are some specific implementation issues with this that
> I didn't quite anticipate. I would like to get feedback on these
> issues now, from both Thomas and Robert. The issues relate to how much
> the patch can or should "buy into resource management". You might
> guess that this new resource management code is something that should
> live in fd.c, alongside the guts of temp file resource management,
> within the function FileClose(). That way, it would be called by every
> possible path that might delete a temp file, including
> ResourceOwnerReleaseInternal(). That's not what I've done, though.
> Instead, refcount management is limited to a few higher level routines
> in buffile.c. Initially, resource management in FileClose() is made to
> assume that it must delete the file. Then, if and when directed to by
> BufFileClose()/refcount, a backend may determine that it is not its
> job to do the deletion -- it will not be the one that must "turn out
> the lights", and so indicates to FileClose() that it should not delete
> the file after all (it should just release vFDs, close(), and so on).
> Otherwise, when refcount reaches zero, temp files are deleted by
> FileClose() in more or less the conventional manner.
>
> The fact that there could, in general, be any error that causes us to
> attempt a double-deletion (deletion of a temp file from more than one
> backend) for a time is less of a problem than you might think. This is
> because there is a risk of this only for as long as two backends hold
> open the file at the same time. In the case of parallel CREATE INDEX,
> this is now the shortest possible period of time, since workers close
> their files using BufFileClose() immediately after the leader wakes
> them up from a quiescent state. And, if that were to actually happen,
> say due to some random OOM error during that small window, the
> consequence is no worse than an annoying log message: "could not
> unlink file..." (this would come from the second backend that
> attempted an unlink()). You would not see this when a worker raised an
> error due to a duplicate violation, or any other routine problem, so
> it should really be almost impossible.
>
> That having been said, this probably *is* a problematic restriction in
> cases where a temp file's ownership is not immediately handed over
> without concurrent sharing. What happens to be a small window for the
> parallel CREATE INDEX patch probably wouldn't be a small window for
> parallel hash join.   :-(
>
> It's not hard to see why I would like to do things this way. Just look
> at ResourceOwnerReleaseInternal(). Any release of a file happens
> during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic
> shared memory segments happens earlier, during
> RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to
> implement a refcount is using dynamic shared memory, and that seems
> hard. There are additional reasons why I suggest we go this way, such
> as the fact that all the relevant state belongs to BufFile, which is
> implemented a layer above all of the guts of resource management of
> temp files within fd.c. I'd have to replicate almost all state in fd.c
> to make it all work, which seems like a big modularity violation.
>
> Does anyone have any suggestions on how to tackle this?

Hmm.  One approach might be like this:

1.  There is a shared refcount which is incremented when you open a
shared file and decremented if you optionally explicitly 'release' it.
(Not when you close it, because we can't allow code that may be run
during RESOURCE_RELEASE_AFTER_LOCKS to try to access the DSM segment
after it has been unmapped; more generally, creating destruction order
dependencies between different kinds of resource-manager-cleaned-up
objects seems like a bad idea.  Of course the close code still looks
after closing the vfds in the local backend.)

2.  If you want to hand the file over to some other process and exit,
you probably want to avoid race conditions or extra IPC burden.  To
achieve that you could 'pin' the file, so that it survives even while
not open in any backend.

3.  If the recount reaches zero when you 'release' and the file isn't
'pinned', then you must delete the underlying files.

4.  When the DSM segment is detached, we spin through all associated
shared files that we're still 'attached' to (ie opened but didn't
release) and decrement the refcount.  If any shared file's refcount
reaches zero its files should be deleted, even if was 'pinned'.

In other words, the associated DSM segment's lifetime is the maximum
lifetime of shared files, but it can be shorter if you 'release' in
all backends and don't 'pin'.  It's up to client code can come up with
some scheme to make that work, if it doesn't take the easy route of
pinning until DSM segment destruction.

I think in your case you'd simply pin all the BufFiles allowing
workers to exit when they're done; the leader would wait for all
workers to indicate they'd finished, and then open the files.  The
files would be deleted eventually when the last process detaches from
the DSM segment (very likely the leader).

In my case I'd pin all shared BufFiles and then release them when I'd
finished reading them back in and didn't need them anymore, and unpin
them in the first participant to discover that the end had been
reached (it would be a programming error to pin twice or unpin twice,
like similarly named operations for DSM segments and DSA areas).
That'd preserve the existing Hash Join behaviour of deleting batch
files as soon as possible, but also guarantee cleanup in any error
case.

There is something a bit unpleasant about teaching other subsystems
about the existence of DSM segments just to be able to use DSM
lifetime as a cleanup scope.  I do think dsm_on_detach is a pretty
good place to do cleanup of resources in parallel computing cases like
ours, but I wonder if we could introduce a more generic destructor
callback interface which DSM segments could provide.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Feb 8, 2017 at 8:40 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Does anyone have any suggestions on how to tackle this?
>
> Hmm.  One approach might be like this:
>
> [hand-wavy stuff]

Thinking a bit harder about this, I suppose there could be a kind of
object called a SharedBufFileManager (insert better name) which you
can store in a DSM segment.  The leader backend that initialises a DSM
segment containing one of these would then call a constructor function
that sets an internal refcount to 1 and registers an on_dsm_detach
callback for its on-detach function.  All worker backends that attach
to the DSM segment would need to call an attach function for the
SharedBufFileManager to increment a refcount and also register the
on_dsm_detach callback, before any chance that an error might be
thrown (is that difficult?); failure to do so could result in file
leaks.  Then, when a BufFile is to be shared (AKA exported, made
unifiable), a SharedBufFile object can be initialised somewhere in the
same DSM segment and registered with the SharedBufFileManager.
Internally all registered SharedBufFile objects would be linked
together using offsets from the start of the DSM segment for link
pointers.  Now when SharedBufFileManager's on-detach function runs, it
decrements the refcount in the SharedBufFileManager, and if that
reaches zero then it runs a destructor that spins through the list of
SharedBufFile objects deleting files that haven't already been deleted
explicitly.

I retract the pin/unpin and per-file refcounting stuff I mentioned
earlier.  You could make the default that all files registered with a
SharedBufFileManager survive until the containing DSM segment is
detached everywhere using that single refcount in the
SharedBufFileManager object, but also provide a 'no really delete this
particular shared file now' operation for client code that knows it's
safe to do that sooner (which would be the case for me, I think).  I
don't think per-file refcounts are needed.

There are a couple of problems with the above though.  Firstly, doing
reference counting in DSM segment on-detach hooks is really a way to
figure out when the DSM segment is about to be destroyed by keeping a
separate refcount in sync with the DSM segment's refcount, but it
doesn't account for pinned DSM segments.  It's not your use-case or
mine currently, but someone might want a DSM segment to live even when
it's not attached anywhere, to be reattached later.  If we're trying
to use DSM segment lifetime as a scope, we'd be ignoring this detail.
Perhaps instead of adding our own refcount we need a new kind of hook
on_dsm_destroy.  Secondly, I might not want to be constrained by a
fixed-sized DSM segment to hold my SharedBufFile objects... there are
cases where I need to shared a number of batch files that is unknown
at the start of execution time when the DSM segment is sized (I'll
write about that shortly on the Parallel Shared Hash thread).  Maybe I
can find a way to get rid of that requirement.  Or maybe it could
support DSA memory too, but I don't think it's possible to use
on_dsm_detach-based cleanup routines that refer to DSA memory because
by the time any given DSM segment's detach hook runs, there's no
telling which other DSM segments have been detached already, so the
DSA area may already have partially vanished; some other kind of hook
that runs earlier would be needed...

Hmm.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Thinking a bit harder about this, I suppose there could be a kind of
> object called a SharedBufFileManager (insert better name) which you
> can store in a DSM segment.  The leader backend that initialises a DSM
> segment containing one of these would then call a constructor function
> that sets an internal refcount to 1 and registers an on_dsm_detach
> callback for its on-detach function.  All worker backends that attach
> to the DSM segment would need to call an attach function for the
> SharedBufFileManager to increment a refcount and also register the
> on_dsm_detach callback, before any chance that an error might be
> thrown (is that difficult?); failure to do so could result in file
> leaks.  Then, when a BufFile is to be shared (AKA exported, made
> unifiable), a SharedBufFile object can be initialised somewhere in the
> same DSM segment and registered with the SharedBufFileManager.
> Internally all registered SharedBufFile objects would be linked
> together using offsets from the start of the DSM segment for link
> pointers.  Now when SharedBufFileManager's on-detach function runs, it
> decrements the refcount in the SharedBufFileManager, and if that
> reaches zero then it runs a destructor that spins through the list of
> SharedBufFile objects deleting files that haven't already been deleted
> explicitly.

I think this is approximately reasonable, but I think it could be made
simpler by having fewer separate objects.  Let's assume the leader can
put an upper bound on the number of shared BufFiles at the time it's
sizing the DSM segment (i.e. before InitializeParallelDSM).  Then it
can allocate a big ol' array with a header indicating the array size
and each element containing enough space to identify the relevant
details of 1 shared BufFile.  Now you don't need to do any allocations
later on, and you don't need a linked list.  You just loop over the
array and do what needs doing.

> There are a couple of problems with the above though.  Firstly, doing
> reference counting in DSM segment on-detach hooks is really a way to
> figure out when the DSM segment is about to be destroyed by keeping a
> separate refcount in sync with the DSM segment's refcount, but it
> doesn't account for pinned DSM segments.  It's not your use-case or
> mine currently, but someone might want a DSM segment to live even when
> it's not attached anywhere, to be reattached later.  If we're trying
> to use DSM segment lifetime as a scope, we'd be ignoring this detail.
> Perhaps instead of adding our own refcount we need a new kind of hook
> on_dsm_destroy.

I think it's good enough to plan for current needs now.  It's not
impossible to change this stuff later, but we need something that
works robustly right now without being too invasive.  Inventing whole
new system concepts because of stuff we might someday want to do isn't
a good idea because we may easily guess wrong about what direction
we'll want to go in the future.  This is more like building a wrench
than a 747: a 747 needs to be extensible and reconfigurable and
upgradable because it costs $350 million.   A wrench costs $10 at
Walmart and if it turns out we bought the wrong one, we can just throw
it out and get a different one later.

> Secondly, I might not want to be constrained by a
> fixed-sized DSM segment to hold my SharedBufFile objects... there are
> cases where I need to shared a number of batch files that is unknown
> at the start of execution time when the DSM segment is sized (I'll
> write about that shortly on the Parallel Shared Hash thread).  Maybe I
> can find a way to get rid of that requirement.  Or maybe it could
> support DSA memory too, but I don't think it's possible to use
> on_dsm_detach-based cleanup routines that refer to DSA memory because
> by the time any given DSM segment's detach hook runs, there's no
> telling which other DSM segments have been detached already, so the
> DSA area may already have partially vanished; some other kind of hook
> that runs earlier would be needed...

Again, wrench.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Fri, Feb 10, 2017 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Thinking a bit harder about this, I suppose there could be a kind of
>> object called a SharedBufFileManager [... description of that ...].
>
> I think this is approximately reasonable, but I think it could be made
> simpler by having fewer separate objects.  Let's assume the leader can
> put an upper bound on the number of shared BufFiles at the time it's
> sizing the DSM segment (i.e. before InitializeParallelDSM).  Then it
> can allocate a big ol' array with a header indicating the array size
> and each element containing enough space to identify the relevant
> details of 1 shared BufFile.  Now you don't need to do any allocations
> later on, and you don't need a linked list.  You just loop over the
> array and do what needs doing.

Makes sense.

>> There are a couple of problems with the above though.  Firstly, doing
>> reference counting in DSM segment on-detach hooks is really a way to
>> figure out when the DSM segment is about to be destroyed by keeping a
>> separate refcount in sync with the DSM segment's refcount, but it
>> doesn't account for pinned DSM segments.  It's not your use-case or
>> mine currently, but someone might want a DSM segment to live even when
>> it's not attached anywhere, to be reattached later.  If we're trying
>> to use DSM segment lifetime as a scope, we'd be ignoring this detail.
>> Perhaps instead of adding our own refcount we need a new kind of hook
>> on_dsm_destroy.
>
> I think it's good enough to plan for current needs now.  It's not
> impossible to change this stuff later, but we need something that
> works robustly right now without being too invasive.  Inventing whole
> new system concepts because of stuff we might someday want to do isn't
> a good idea because we may easily guess wrong about what direction
> we'll want to go in the future.  This is more like building a wrench
> than a 747: a 747 needs to be extensible and reconfigurable and
> upgradable because it costs $350 million.   A wrench costs $10 at
> Walmart and if it turns out we bought the wrong one, we can just throw
> it out and get a different one later.

I agree that the pinned segment case doesn't matter right now, I just
wanted to point it out.  I like your $10 wrench analogy, but maybe it
could be argued that adding a dsm_on_destroy() callback mechanism is
not only better than adding another refcount to track that other
refcount, but also a steal at only $8.

>> Secondly, I might not want to be constrained by a
>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>> cases where I need to shared a number of batch files that is unknown
>> at the start of execution time when the DSM segment is sized (I'll
>> write about that shortly on the Parallel Shared Hash thread).  Maybe I
>> can find a way to get rid of that requirement.  Or maybe it could
>> support DSA memory too, but I don't think it's possible to use
>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>> by the time any given DSM segment's detach hook runs, there's no
>> telling which other DSM segments have been detached already, so the
>> DSA area may already have partially vanished; some other kind of hook
>> that runs earlier would be needed...
>
> Again, wrench.

My problem here is that I don't know how many batches I'll finish up
creating.  In general that's OK because I can hold onto them as
private BufFiles owned by participants with the existing cleanup
mechanism, and then share them just before they need to be shared (ie
when we switch to processing the next batch so they need to be
readable by all).  Now I only ever share one inner and one outer batch
file per participant at a time, and then I explicitly delete them at a
time that I know to be safe and before I need to share a new file that
would involve recycling the slot, and I'm relying on DSM segment scope
cleanup only to handle error paths.  That means that in generally I
only need space for 2 * P shared BufFiles at a time.  But there is a
problem case: when the leader needs to exit early, it needs to be able
to transfer ownership of any files it has created, which could be more
than we planned for, and then not participate any further in the hash
join, so it can't participate in the on-demand sharing scheme.

Perhaps we can find a way to describe a variable number of BufFiles
(ie batches) in a fixed space by making sure the filenames are
constructed in a way that lets us just have to say how many there are.
Then the next problem is that for each BufFile we have to know how
many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
...), which Peter's code currently captures by publishing the file
size in the descriptor... but if a fixed size object must describe N
BufFiles, where can I put the size of each one?  Maybe I could put it
in a header of the file itself (yuck!), or maybe I could decide that I
don't care what the size is, I'll simply unlink "foo", then "foo.1",
then "foo.2", ... until I get ENOENT.

Alternatively I might get rid of the requirement for the leader to
drop out of processing later batches.  I'm about to post a message to
the other thread about how to do that, but it's complicated and I'm
currently working on the assumption that the PSH patch is useful
without it (but let's not discuss that in this thread).  That would
have the side effect of getting rid of the requirement to share a
number of BufFiles that isn't known up front.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I agree that the pinned segment case doesn't matter right now, I just
> wanted to point it out.  I like your $10 wrench analogy, but maybe it
> could be argued that adding a dsm_on_destroy() callback mechanism is
> not only better than adding another refcount to track that other
> refcount, but also a steal at only $8.

If it's that simple, it might be worth doing, but I bet it's not.  One
problem is that there's a race condition: there will inevitably be a
period of time after you've called dsm_attach() and before you've
attached to the specific data structure that we're talking about here.
So suppose the last guy who actually knows about this data structure
dies horribly and doesn't clean up because the DSM isn't being
destroyed; moments later, you die horribly before reaching the code
where you attach to this data structure.  Oops.

You might think about plugging that hole by moving the registry of
on-destroy functions into the segment itself and making it a shared
resource.  But ASLR breaks that, especially for loadable modules.  You
could try to fix that problem, in turn, by storing arguments that can
later be passed to load_external_function() instead of a function
pointer per se.  But that sounds pretty fragile because some other
backend might not try to load the module until after it's attached the
DSM segment and it might then fail because loading the module runs
_PG_init() which can throw errors.   Maybe you can think of a way to
plug that hole too but you're waaaaay over your $8 budget by this
point.

>>> Secondly, I might not want to be constrained by a
>>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>>> cases where I need to shared a number of batch files that is unknown
>>> at the start of execution time when the DSM segment is sized (I'll
>>> write about that shortly on the Parallel Shared Hash thread).  Maybe I
>>> can find a way to get rid of that requirement.  Or maybe it could
>>> support DSA memory too, but I don't think it's possible to use
>>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>>> by the time any given DSM segment's detach hook runs, there's no
>>> telling which other DSM segments have been detached already, so the
>>> DSA area may already have partially vanished; some other kind of hook
>>> that runs earlier would be needed...
>>
>> Again, wrench.
>
> My problem here is that I don't know how many batches I'll finish up
> creating.  In general that's OK because I can hold onto them as
> private BufFiles owned by participants with the existing cleanup
> mechanism, and then share them just before they need to be shared (ie
> when we switch to processing the next batch so they need to be
> readable by all).  Now I only ever share one inner and one outer batch
> file per participant at a time, and then I explicitly delete them at a
> time that I know to be safe and before I need to share a new file that
> would involve recycling the slot, and I'm relying on DSM segment scope
> cleanup only to handle error paths.  That means that in generally I
> only need space for 2 * P shared BufFiles at a time.  But there is a
> problem case: when the leader needs to exit early, it needs to be able
> to transfer ownership of any files it has created, which could be more
> than we planned for, and then not participate any further in the hash
> join, so it can't participate in the on-demand sharing scheme.

I thought the idea was that the structure we're talking about here
owns all the files, up to 2 from a leader that wandered off plus up to
2 for each worker.  Last process standing removes them.  Or are you
saying each worker only needs 2 files but the leader needs a
potentially unbounded number?

> Perhaps we can find a way to describe a variable number of BufFiles
> (ie batches) in a fixed space by making sure the filenames are
> constructed in a way that lets us just have to say how many there are.

That could be done.

> Then the next problem is that for each BufFile we have to know how
> many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
> ...), which Peter's code currently captures by publishing the file
> size in the descriptor... but if a fixed size object must describe N
> BufFiles, where can I put the size of each one?  Maybe I could put it
> in a header of the file itself (yuck!), or maybe I could decide that I
> don't care what the size is, I'll simply unlink "foo", then "foo.1",
> then "foo.2", ... until I get ENOENT.

There's nothing wrong with that algorithm as far as I'm concerned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Fri, Feb 10, 2017 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> I agree that the pinned segment case doesn't matter right now, I just
>> wanted to point it out.  I like your $10 wrench analogy, but maybe it
>> could be argued that adding a dsm_on_destroy() callback mechanism is
>> not only better than adding another refcount to track that other
>> refcount, but also a steal at only $8.
>
> If it's that simple, it might be worth doing, but I bet it's not.  One
> problem is that there's a race condition: there will inevitably be a
> period of time after you've called dsm_attach() and before you've
> attached to the specific data structure that we're talking about here.
> So suppose the last guy who actually knows about this data structure
> dies horribly and doesn't clean up because the DSM isn't being
> destroyed; moments later, you die horribly before reaching the code
> where you attach to this data structure.  Oops.

Right, I mentioned this problem earlier ("and also register the
on_dsm_detach callback, before any chance that an error might be
thrown (is that difficult?); failure to do so could result in file
leaks").

Here's my thought process... please tell me where I'm going wrong:

I have been assuming that it's not enough to just deal with this when
the leader detaches on the theory that other participants will always
detach first: that probably isn't true in some error cases, and could
contribute to spurious racy errors where other workers complain about
disappearing files if the leader somehow shuts down and cleans up
while a worker is still running.  Therefore we need *some* kind of
refcounting, whether it's a new kind or a new mechanism based on the
existing kind.

I have also been assuming that we don't want to teach dsm.c directly
about this stuff; it shouldn't need to know about other modules, so we
don't want it talking to buffile.c directly and managing a special
table of files; instead we want a system of callbacks.  Therefore
client code needs to do something after attaching to the segment in
each backend.

It doesn't matter whether we use an on_dsm_detach() callback and
manage our own refcount to infer that destruction is imminent, or a
new on_dsm_destroy() callback which tells us so explicitly: both ways
we'll need to make sure that anyone who attaches to the segment also
"attaches" to this shared BufFile manager object inside it, because
any backend might turn out to be the one that is last to detach.

That bring us to the race you mentioned.  Isn't it sufficient to say
that you aren't allowed to do anything that might throw in between
attaching to the segment and attaching to the SharedBufFileManager
that it contains?

Up until two minutes ago I assumed that policy would leave only two
possibilities: you attach to the DSM segment and attach to the
SharedBufFileManager successfully or you attach to the DSM segment and
then die horribly (but not throw) and the postmaster restarts the
whole cluster and blows all temp files away with RemovePgTempFiles().
But I see now in the comment of that function that crash-induced
restarts don't call that because "someone might want to examine the
temp files for debugging purposes".  Given that policy for regular
private BufFiles, I don't see why that shouldn't apply equally to
shared files: after a crash restart, you may have some junk files that
won't be cleaned up until your next clean restart, whether they were
private or shared BufFiles.

> You might think about plugging that hole by moving the registry of
> on-destroy functions into the segment itself and making it a shared
> resource.  But ASLR breaks that, especially for loadable modules.  You
> could try to fix that problem, in turn, by storing arguments that can
> later be passed to load_external_function() instead of a function
> pointer per se.  But that sounds pretty fragile because some other
> backend might not try to load the module until after it's attached the
> DSM segment and it might then fail because loading the module runs
> _PG_init() which can throw errors.   Maybe you can think of a way to
> plug that hole too but you're waaaaay over your $8 budget by this
> point.

Agreed, those approaches seem like a non-starters.

>> My problem here is that I don't know how many batches I'll finish up
>> creating.  [...]
>
> I thought the idea was that the structure we're talking about here
> owns all the files, up to 2 from a leader that wandered off plus up to
> 2 for each worker.  Last process standing removes them.  Or are you
> saying each worker only needs 2 files but the leader needs a
> potentially unbounded number?

Yes, potentially unbounded in rare case.  If we plan for N batches,
and then run out of work_mem because our estimates were just wrong or
the distributions of keys is sufficiently skewed, we'll run
HashIncreaseNumBatches, and that could happen more than once.  I have
a suite of contrived test queries that hits all the various modes and
code paths of hash join, and it includes a query that plans for one
batch but finishes up creating many, and then the leader exits.  I'll
post that to the other thread along with my latest patch series soon.

>> Perhaps we can find a way to describe a variable number of BufFiles
>> (ie batches) in a fixed space by making sure the filenames are
>> constructed in a way that lets us just have to say how many there are.
>
> That could be done.

Cool.

>> Then the next problem is that for each BufFile we have to know how
>> many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
>> ...), which Peter's code currently captures by publishing the file
>> size in the descriptor... but if a fixed size object must describe N
>> BufFiles, where can I put the size of each one?  Maybe I could put it
>> in a header of the file itself (yuck!), or maybe I could decide that I
>> don't care what the size is, I'll simply unlink "foo", then "foo.1",
>> then "foo.2", ... until I get ENOENT.
>
> There's nothing wrong with that algorithm as far as I'm concerned.

Cool.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Feb 9, 2017 at 2:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> You might think about plugging that hole by moving the registry of
> on-destroy functions into the segment itself and making it a shared
> resource.  But ASLR breaks that, especially for loadable modules.  You
> could try to fix that problem, in turn, by storing arguments that can
> later be passed to load_external_function() instead of a function
> pointer per se.  But that sounds pretty fragile because some other
> backend might not try to load the module until after it's attached the
> DSM segment and it might then fail because loading the module runs
> _PG_init() which can throw errors.   Maybe you can think of a way to
> plug that hole too but you're waaaaay over your $8 budget by this
> point.

At the risk of stating the obvious, ISTM that the right way to do
this, at a high level, is to err on the side of unneeded extra
unlink() calls, not leaking files. And, to make the window for problem
("remaining hole that you haven't quite managed to plug") practically
indistinguishable from no hole at all, in a way that's kind of baked
into the API.

It's not like we currently throw an error when there is a problem with
deleting temp files that are no longer needed on resource manager
cleanup. We simply log the fact that it happened, and limp on.

I attach my V8. This does not yet do anything with on_dsm_detach().
I've run out of time to work on it this week, and am starting a new
job next week at VMware, which I'll need time to settle into. So I'm
posting this now, since you can still very much see the direction I'm
going in, and can give me any feedback that you have. If anyone wants
to show me how its done by building on this, and finishing what I have
off, be my guest. The new stuff probably isn't quite as polished as I
would prefer, but time grows short, so I won't withhold it.

Changes:

* Implements refcount thing, albeit in a way that leaves a small
window for double unlink() calls if there is an error during the small
window in which there is worker/leader co-ownership of a BufFile (just
add an "elog(ERROR)" just before leader-as-worker Tuplesort state is
ended within _bt_leafbuild() to see what I mean). This implies that
background workers can be reclaimed once the leader needs to start its
final on-the-fly merge, which is nice. As an example of how that's
nice, this change makes maintenance_work_mem a budget that we more
strictly adhere to.

* Fixes bitrot caused by recent logtape.c bugfix in master branch.

* No local segment is created during unification unless and until one
is required. (In practice, for current use of BufFile infrastructure,
no "local" segment is ever created, even if we force a randomAccess
case using one of the testing GUCs from 0002-* -- we'd have to use
another GUC to *also* force there to be no reclaimation.)

* Better testing. As I just mentioned, we can now force logtape.c to
not reclaim blocks, so you make new local segments as part of a
unified BufFile, which have different considerations from a resource
management point of view. Despite being part of the same "unified"
BufFile from the leader's perspective, it behaves like a local
segment, so it definitely seems like a good idea to have test coverage
for this, at least during development. (I have a pretty rough test
suite that I'm using; development of this patch has been somewhat test
driven.)

* Better encapsulation of BufFile stuff. I am even closer to the ideal
of this whole sharing mechanism being a fairly generic BufFile thing
that logtape.c piggy-backs on without having special knowledge of the
mechanism. It's still true that the mechanism (sharing/unification) is
written principally with logtape.c in mind, but that's just because of
its performance characteristics. Nothing to do with the interface.

* Worked through items raised by Thomas in his 2017-01-30 mail to this thread.

>>>> Secondly, I might not want to be constrained by a
>>>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>>>> cases where I need to shared a number of batch files that is unknown
>>>> at the start of execution time when the DSM segment is sized (I'll
>>>> write about that shortly on the Parallel Shared Hash thread).  Maybe I
>>>> can find a way to get rid of that requirement.  Or maybe it could
>>>> support DSA memory too, but I don't think it's possible to use
>>>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>>>> by the time any given DSM segment's detach hook runs, there's no
>>>> telling which other DSM segments have been detached already, so the
>>>> DSA area may already have partially vanished; some other kind of hook
>>>> that runs earlier would be needed...
>>>
>>> Again, wrench.

I like the wrench analogy too, FWIW.

>> My problem here is that I don't know how many batches I'll finish up
>> creating.  In general that's OK because I can hold onto them as
>> private BufFiles owned by participants with the existing cleanup
>> mechanism, and then share them just before they need to be shared (ie
>> when we switch to processing the next batch so they need to be
>> readable by all).  Now I only ever share one inner and one outer batch
>> file per participant at a time, and then I explicitly delete them at a
>> time that I know to be safe and before I need to share a new file that
>> would involve recycling the slot, and I'm relying on DSM segment scope
>> cleanup only to handle error paths.  That means that in generally I
>> only need space for 2 * P shared BufFiles at a time.  But there is a
>> problem case: when the leader needs to exit early, it needs to be able
>> to transfer ownership of any files it has created, which could be more
>> than we planned for, and then not participate any further in the hash
>> join, so it can't participate in the on-demand sharing scheme.

I think that parallel CREATE INDEX can easily live with the
restriction that we need to know how many shared BufFiles are needed
up front. It will either be 1, or 2 (when there are 2 nbtsort.c
spools, for unique index builds). We can also detect when the limit is
already exceeded early, and back out, just as we do when there are no
parallel workers currently available.

>> Then the next problem is that for each BufFile we have to know how
>> many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
>> ...), which Peter's code currently captures by publishing the file
>> size in the descriptor... but if a fixed size object must describe N
>> BufFiles, where can I put the size of each one?  Maybe I could put it
>> in a header of the file itself (yuck!), or maybe I could decide that I
>> don't care what the size is, I'll simply unlink "foo", then "foo.1",
>> then "foo.2", ... until I get ENOENT.
>
> There's nothing wrong with that algorithm as far as I'm concerned.

I would like to point out, just to be completely clear, that while
this V8 doesn't "do refcounts properly" (it doesn't use a
on_dsm_detach() hook and so on), the only benefit that that would
actually have for parallel CREATE INDEX is that it makes it impossible
that the user could see a spurious ENOENT related log message during
unlink() (I err on the side of doing too much unlinking, not too
little). Which is very unlikely anyway. So, if that's okay for
parallel hash join, as indicated by Robert here, an issue like that
would presumably also be okay for parallel CREATE INDEX. It then
follows that what I'm missing here is something that is only really
needed for the parallel hash join patch anyway.

I really want to help Thomas, and am not shirking what I feel is a
responsibility to assist him. I have every intention of breaking this
down to produce a usable patch that only has the BufFile + resource
managemnt stuff, that follows the interface he sketched as a
requirement for me in his most recent revision of his patch series
("0009-hj-shared-buffile-strawman-v4.patch"). I'm just pointing out
that my patch is reasonably complete as a standalone piece of work
right now, AFAICT.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here's my thought process... please tell me where I'm going wrong:
>
> I have been assuming that it's not enough to just deal with this when
> the leader detaches on the theory that other participants will always
> detach first: that probably isn't true in some error cases, and could
> contribute to spurious racy errors where other workers complain about
> disappearing files if the leader somehow shuts down and cleans up
> while a worker is still running.  Therefore we need *some* kind of
> refcounting, whether it's a new kind or a new mechanism based on the
> existing kind.

+1.

> I have also been assuming that we don't want to teach dsm.c directly
> about this stuff; it shouldn't need to know about other modules, so we
> don't want it talking to buffile.c directly and managing a special
> table of files; instead we want a system of callbacks.  Therefore
> client code needs to do something after attaching to the segment in
> each backend.

+1.

> It doesn't matter whether we use an on_dsm_detach() callback and
> manage our own refcount to infer that destruction is imminent, or a
> new on_dsm_destroy() callback which tells us so explicitly: both ways
> we'll need to make sure that anyone who attaches to the segment also
> "attaches" to this shared BufFile manager object inside it, because
> any backend might turn out to be the one that is last to detach.

Not entirely.  In the first case, you don't need the requirement that
everyone who attaches the segment must attach to the shared BufFile
manager.  In the second case, you do.

> That bring us to the race you mentioned.  Isn't it sufficient to say
> that you aren't allowed to do anything that might throw in between
> attaching to the segment and attaching to the SharedBufFileManager
> that it contains?

That would be sufficient, but I think it's not a very good design.  It
means, for example, that nothing between the time you attach to the
segment and the time you attach to this manager can palloc() anything.
So, for example, it would have to happen before ParallelWorkerMain
reaches the call to shm_mq_attach, which kinda sucks because we want
to do that as soon as possible after attaching to the DSM segment so
that errors are reported properly thereafter.  Note that's the very
first thing we do now, except for working out what the arguments to
that call need to be.

Also, while it's currently safe to assume that shm_toc_attach() and
shm_toc_lookup() don't throw errors, I've thought about the
possibility of installing some sort of cache in shm_toc_lookup() to
amortize the cost of lookups, if the number of keys ever got too
large.  And that would then require a palloc().  Generally, backend
code should be free to throw errors.  When it's absolutely necessary
for a short segment of code to avoid that, then we do, but you can't
really rely on any substantial amount of code to be that way, or stay
that way.

And in this case, even if we didn't mind those problems or had some
solution to them, I think that the shared buffer manager shouldn't
have to be something that is whacked directly into parallel.c all the
way at the beginning of the initialization sequence so that nothing
can fail before it happens.  I think it should be an optional data
structure that clients of the parallel infrastructure can decide to
use, or to not use.  It should be at arm's length from the core code,
just like the way ParallelQueryMain() is distinct from
ParallelWorkerMain() and sets up its own set of data structures with
their own set of keys.  All that stuff is happy to happen after
whatever ParallelWorkerMain() feels that it needs to do, even if
ParallelWorkerMain might throw errors for any number of unknown
reasons.  Similarly, I think this new things should be something than
an executor node can decide to create inside its own per-node space --
reserved via ExecParallelEstimate, initialized
ExecParallelInitializeDSM, etc.  There's no need for it to be deeply
coupled to parallel.c itself unless we force that choice by sticking a
no-fail requirement in there.

> Up until two minutes ago I assumed that policy would leave only two
> possibilities: you attach to the DSM segment and attach to the
> SharedBufFileManager successfully or you attach to the DSM segment and
> then die horribly (but not throw) and the postmaster restarts the
> whole cluster and blows all temp files away with RemovePgTempFiles().
> But I see now in the comment of that function that crash-induced
> restarts don't call that because "someone might want to examine the
> temp files for debugging purposes".  Given that policy for regular
> private BufFiles, I don't see why that shouldn't apply equally to
> shared files: after a crash restart, you may have some junk files that
> won't be cleaned up until your next clean restart, whether they were
> private or shared BufFiles.

I think most people (other than Tom) would agree that that policy
isn't really sensible any more; it probably made sense when the
PostgreSQL user community was much smaller and consisted mostly of the
people developing PostgreSQL, but these days it's much more likely to
cause operational headaches than to help a developer debug.
Regardless, I think the primary danger isn't failure to remove a file
(although that is best avoided) but removing one too soon (causing
someone else to error when opening it, or on Windows causing the
delete itself to error out).  It's not really OK for random stuff to
throw errors in corner cases because we were too lazy to ensure that
cleanup operations happen in the right order.

>> I thought the idea was that the structure we're talking about here
>> owns all the files, up to 2 from a leader that wandered off plus up to
>> 2 for each worker.  Last process standing removes them.  Or are you
>> saying each worker only needs 2 files but the leader needs a
>> potentially unbounded number?
>
> Yes, potentially unbounded in rare case.  If we plan for N batches,
> and then run out of work_mem because our estimates were just wrong or
> the distributions of keys is sufficiently skewed, we'll run
> HashIncreaseNumBatches, and that could happen more than once.  I have
> a suite of contrived test queries that hits all the various modes and
> code paths of hash join, and it includes a query that plans for one
> batch but finishes up creating many, and then the leader exits.  I'll
> post that to the other thread along with my latest patch series soon.

Hmm, OK.  So that's going to probably require something where a fixed
amount of DSM can describe an arbitrary number of temp file series.
But that also means this is an even-more-special-purpose tool that
shouldn't be deeply tied into parallel.c so that it can run before any
errors happen.

Basically, I think the "let's write the code between here and here so
it throws no errors" technique is, for 99% of PostgreSQL programming,
difficult and fragile.  We shouldn't rely on it if there is some other
reasonable option.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> It might just have been that the table was too small to be an
> effective target for parallel sequential scan with so many workers,
> and so a presorted best case CREATE INDEX, which isn't that different,
> also fails to see much benefit (compared to what you'd see with a
> similar case involving a larger table). In other words, I might have
> jumped the gun in emphasizing issues with hardware and I/O bandwidth
> over issues around data volume (that I/O parallelism is inherently not
> very helpful with these relatively small tables).
>
> As I've pointed out a couple of times before, bigger sorts will be
> more CPU bound because sorting itself has costs that grow
> linearithmically, whereas writing out runs has costs that grow
> linearly. The relative cost of the I/O can be expected to go down as
> input goes up for this reason. At the same time, a larger input might
> make better use of I/O parallelism, which reduces the cost paid in
> latency to write out runs in absolute terms.

Here are some results with your latest patch, using the same test as
before but this time with SCALE=100 (= 100,000,000 rows).  The table
sizes are:

                             List of relations
 Schema |         Name         | Type  |    Owner     | Size  | Description
--------+----------------------+-------+--------------+-------+-------------
 public | million_words        | table | thomas.munro | 42 MB |
 public | some_words           | table | thomas.munro | 19 MB |
 public | test_intwide_u_asc   | table | thomas.munro | 18 GB |
 public | test_intwide_u_desc  | table | thomas.munro | 18 GB |
 public | test_intwide_u_rand  | table | thomas.munro | 18 GB |
 public | test_textwide_u_asc  | table | thomas.munro | 19 GB |
 public | test_textwide_u_desc | table | thomas.munro | 19 GB |
 public | test_textwide_u_rand | table | thomas.munro | 19 GB |

To reduce the number of combinations I did only unique data and built
only non-unique indexes with only 'wide' tuples (= key plus a text
column that holds a 151-character wide string, rather than just the
key), and also didn't bother with the 1MB memory size as suggested.
Here are the results up to 4 workers (a results table going up to 8
workers is attached, since it wouldn't format nicely if I pasted it
here).  Again, the w = 0 time is seconds, the rest show relative
speed-up.  This data was all in the OS page cache because of a dummy
run done first, and I verified with 'sar' that there was exactly 0
reading from the block device.  The CPU was pegged on leader + workers
during sort runs, and then the leader's CPU hovered around 93-98%
during the merge/btree build.  I had some technical problems getting a
cold-cache read-from-actual-disk-each-time test run to work properly,
but can go back and do that again if anyone thinks that would be
interesting data to see.

   tab    | ord  | mem |  w = 0  | w = 1 | w = 2 | w = 3 | w = 4
----------+------+-----+---------+-------+-------+-------+-------
 intwide  | asc  |  64 |   67.91 | 1.26x | 1.46x | 1.62x | 1.73x
 intwide  | asc  | 256 |   67.84 | 1.23x | 1.48x | 1.63x | 1.79x
 intwide  | asc  | 512 |   69.01 | 1.25x | 1.50x | 1.63x | 1.80x
 intwide  | desc |  64 |   98.08 | 1.48x | 1.83x | 2.03x | 2.25x
 intwide  | desc | 256 |   99.87 | 1.43x | 1.80x | 2.03x | 2.29x
 intwide  | desc | 512 |  104.09 | 1.44x | 1.85x | 2.09x | 2.33x
 intwide  | rand |  64 |  138.03 | 1.56x | 2.04x | 2.42x | 2.58x
 intwide  | rand | 256 |  139.44 | 1.61x | 2.04x | 2.38x | 2.56x
 intwide  | rand | 512 |  138.96 | 1.52x | 2.03x | 2.28x | 2.57x
 textwide | asc  |  64 |  207.10 | 1.20x | 1.07x | 1.09x | 1.11x
 textwide | asc  | 256 |  200.62 | 1.19x | 1.06x | 1.04x | 0.99x
 textwide | asc  | 512 |  191.42 | 1.16x | 0.97x | 1.01x | 0.94x
 textwide | desc |  64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x
 textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x
 textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x
 textwide | rand |  64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x
 textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x
 textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x

"textwide" "asc" is nearly an order of magnitude faster than other
initial orders without parallelism, but then parallelism doesn't seem
to help it much.  Also, using more that 64MB doesn't ever seem to help
very much; in the "desc" case it hinders.

I was curious to understand how performance changes if we become just
a bit less correlated (rather than completely uncorrelated or
perfectly inversely correlated), so I tried out a 'banana skin' case:
I took the contents of the textwide asc table and copied it to a new
table, and then moved the 900 words matching 'banana%' to the physical
end of the heap by deleting and reinserting them in one transaction.
I guess if we were to use this technology for CLUSTER, this might be
representative of a situation where you regularly recluster a growing
table.  The results were pretty much like "asc":

   tab    |  ord   | mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4
----------+--------+-----+--------+-------+-------+-------+-------
 textwide | banana |  64 | 213.39 | 1.17x | 1.11x | 1.15x | 1.09x

It's hard to speculate about this, but I guess that a significant
number of indexes in real world databases might be uncorrelated to
insert order.  A newly imported or insert-only table might have one
highly correlated index for a surrogate primary key or time column,
but other indexes might tend to be uncorrelated.  But really, who
knows...  in a kind of textbook perfectly correlated case such as a
time series table with an append-only time or sequence based key, you
might want to use BRIN rather than B-Tree anyway.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> At the risk of stating the obvious, ISTM that the right way to do
> this, at a high level, is to err on the side of unneeded extra
> unlink() calls, not leaking files. And, to make the window for problem
> ("remaining hole that you haven't quite managed to plug") practically
> indistinguishable from no hole at all, in a way that's kind of baked
> into the API.

I do not think there should be any reason why we can't get the
resource accounting exactly correct here.  If a single backend manages
to remove every temporary file that it creates exactly once (and
that's currently true, modulo system crashes), a group of cooperating
backends ought to be able to manage to remove every temporary file
that any of them create exactly once (again, modulo system crashes).

I do agree that a duplicate unlink() call isn't as bad as a missing
unlink() call, at least if there's no possibility that the filename
could have been reused by some other process, or some other part of
our own process, which doesn't want that new file unlinked.  But it's
messy.  If the seatbelts in your car were to randomly unbuckle, that
would be a safety hazard.  If they were to randomly refuse to
unbuckle, you wouldn't say "that's OK because it's not a safety
hazard", you'd say "these seatbelts are badly designed".  And I think
the same is true of this mechanism.

The way to make this 100% reliable is to set things up so that there
is joint ownership from the beginning and shared state that lets you
know whether the work has already been done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Feb 16, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> At the risk of stating the obvious, ISTM that the right way to do
>> this, at a high level, is to err on the side of unneeded extra
>> unlink() calls, not leaking files. And, to make the window for problem
>> ("remaining hole that you haven't quite managed to plug") practically
>> indistinguishable from no hole at all, in a way that's kind of baked
>> into the API.
>
> I do not think there should be any reason why we can't get the
> resource accounting exactly correct here.  If a single backend manages
> to remove every temporary file that it creates exactly once (and
> that's currently true, modulo system crashes), a group of cooperating
> backends ought to be able to manage to remove every temporary file
> that any of them create exactly once (again, modulo system crashes).

I believe that we are fully in agreement here. In particular, I think
it's bad that there is an API that says "caller shouldn't throw an
elog error between these two points", and that will be fixed before
too long. I just think that it's worth acknowledging a certain nuance.

> I do agree that a duplicate unlink() call isn't as bad as a missing
> unlink() call, at least if there's no possibility that the filename
> could have been reused by some other process, or some other part of
> our own process, which doesn't want that new file unlinked.  But it's
> messy.  If the seatbelts in your car were to randomly unbuckle, that
> would be a safety hazard.  If they were to randomly refuse to
> unbuckle, you wouldn't say "that's OK because it's not a safety
> hazard", you'd say "these seatbelts are badly designed".  And I think
> the same is true of this mechanism.

If it happened in the lifetime of only one out of a million seatbelts
manufactured, and they were manufactured at a competitive price (not
over-engineered), I probably wouldn't say that. The fact that the
existing resource manger code only LOGs most temp file related
failures suggests to me that that's a "can't happen" condition, but we
still hedge. I would still like to hedge against even (theoretically)
impossible risks.

Maybe I'm just being pedantic here, since we both actually want the
code to do the same thing.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Feb 15, 2017 at 6:05 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here are some results with your latest patch, using the same test as
> before but this time with SCALE=100 (= 100,000,000 rows).

Cool.

> To reduce the number of combinations I did only unique data and built
> only non-unique indexes with only 'wide' tuples (= key plus a text
> column that holds a 151-character wide string, rather than just the
> key), and also didn't bother with the 1MB memory size as suggested.
> Here are the results up to 4 workers (a results table going up to 8
> workers is attached, since it wouldn't format nicely if I pasted it
> here).

I think that you are still I/O bound in a way that is addressable by
adding more disks. The exception is the text cases, where the patch
does best. (I don't place too much emphasis on that because I know
that in the long term, we'll have abbreviated keys, which will take
some of the sheen off of that.)

> Again, the w = 0 time is seconds, the rest show relative
> speed-up.

I think it's worth pointing out that while there are cases where we
see no benefit from going from 4 to 8 workers, it tends to hardly hurt
at all, or hardly help at all. It's almost irrelevant that the number
of workers used is excessive, at least up until the point when all
cores have their own worker. That's a nice quality for this to have --
the only danger is that we use parallelism when we shouldn't have at
all, because the serial case could manage an internal sort, and the
sort was small enough that that could be a notable factor.

> "textwide" "asc" is nearly an order of magnitude faster than other
> initial orders without parallelism, but then parallelism doesn't seem
> to help it much.  Also, using more that 64MB doesn't ever seem to help
> very much; in the "desc" case it hinders.

Maybe it's CPU cache efficiency? There are edge cases where multiple
passes are faster than one pass. That'ks the only explanation I can
think of.

> I was curious to understand how performance changes if we become just
> a bit less correlated (rather than completely uncorrelated or
> perfectly inversely correlated), so I tried out a 'banana skin' case:
> I took the contents of the textwide asc table and copied it to a new
> table, and then moved the 900 words matching 'banana%' to the physical
> end of the heap by deleting and reinserting them in one transaction.

A likely problem with that is that most runs will actually not have
their own banana skin, so to speak. You only see a big drop in
performance when every quicksort operation has presorted input, but
with one or more out-of-order tuples at the end. In order to see a
really unfortunate case with parallel CREATE INDEX, you'd probably
have to have enough memory that workers don't need to do their own
merge (and so worker's work almost entirely consists of one big
quicksort operation), with enough "banana skin heap pages" that the
parallel heap scan is pretty much guaranteed to end up giving "banana
skin" (out of order) tuples to every worker, making all of them "have
a slip" (throw away a huge amount of work as the presorted
optimization is defeated right at the end of its sequential read
through).

A better approach would be to have several small localized areas
across input where input tuples are a little out of order. That would
probably show that the performance is pretty in line with random
cases.

> It's hard to speculate about this, but I guess that a significant
> number of indexes in real world databases might be uncorrelated to
> insert order.

That would certainly be true with text, where we see a risk of (small)
regressions.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Feb 16, 2017 at 11:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Maybe I'm just being pedantic here, since we both actually want the
> code to do the same thing.

Pedantry from either of us?  Nah...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Sat, Feb 11, 2017 at 1:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Yes, potentially unbounded in rare case.  If we plan for N batches,
>> and then run out of work_mem because our estimates were just wrong or
>> the distributions of keys is sufficiently skewed, we'll run
>> HashIncreaseNumBatches, and that could happen more than once.  I have
>> a suite of contrived test queries that hits all the various modes and
>> code paths of hash join, and it includes a query that plans for one
>> batch but finishes up creating many, and then the leader exits.  I'll
>> post that to the other thread along with my latest patch series soon.
>
> Hmm, OK.  So that's going to probably require something where a fixed
> amount of DSM can describe an arbitrary number of temp file series.
> But that also means this is an even-more-special-purpose tool that
> shouldn't be deeply tied into parallel.c so that it can run before any
> errors happen.
>
> Basically, I think the "let's write the code between here and here so
> it throws no errors" technique is, for 99% of PostgreSQL programming,
> difficult and fragile.  We shouldn't rely on it if there is some other
> reasonable option.

I'm testing a patch that lets you set up a fixed sized
SharedBufFileSet object in a DSM segment, with its own refcount for
the reason you explained.  It supports a dynamically expandable set of
numbered files, so each participant gets to export file 0, file 1,
file 2 and so on as required, in any order.  I think this should suit
both Parallel Tuplesort which needs to export just one file from each
participant, and Parallel Shared Hash which doesn't know up front how
many batches it will produce.  Not quite ready but I will post a
version tomorrow to get Peter's reaction.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Mar 1, 2017 at 10:29 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I'm testing a patch that lets you set up a fixed sized
> SharedBufFileSet object in a DSM segment, with its own refcount for
> the reason you explained.  It supports a dynamically expandable set of
> numbered files, so each participant gets to export file 0, file 1,
> file 2 and so on as required, in any order.  I think this should suit
> both Parallel Tuplesort which needs to export just one file from each
> participant, and Parallel Shared Hash which doesn't know up front how
> many batches it will produce.  Not quite ready but I will post a
> version tomorrow to get Peter's reaction.

See 0007-hj-shared-buf-file-v6.patch in the v6 tarball in the parallel
shared hash thread.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Feb 16, 2017 at 8:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I do not think there should be any reason why we can't get the
>> resource accounting exactly correct here.  If a single backend manages
>> to remove every temporary file that it creates exactly once (and
>> that's currently true, modulo system crashes), a group of cooperating
>> backends ought to be able to manage to remove every temporary file
>> that any of them create exactly once (again, modulo system crashes).
>
> I believe that we are fully in agreement here. In particular, I think
> it's bad that there is an API that says "caller shouldn't throw an
> elog error between these two points", and that will be fixed before
> too long. I just think that it's worth acknowledging a certain nuance.

I attach my V9 of the patch. I came up some stuff for the design of
resource management that I think meets every design goal that we have
for shared/unified BufFiles:

* Avoids both resource leaks, and spurious double-freeing of resources
(e.g., a second unlink() for a file from a different process) when
there are errors. The latter problem was possible before, a known
issue with V8 of the patch. I believe that this revision avoids these
problems in a way that is *absolutely bulletproof* in the face of
arbitrary failures (e.g., palloc() failure) in any process at any
time. Although, be warned that there is a remaining open item
concerning resource management in the leader-as-worker case, which I
go into below.

There are now what you might call "critical sections" in one function.
That is, there are points where we cannot throw an error (without a
BEGIN_CRIT_SECTION()!), but those are entirely confined to unification
code within the leader, where we can be completely sure that no error
can be raised. The leader can even fail before some but not all of a
particular worker's segments are in its local resource manager, and we
still do the right thing. I've been testing this by adding code that
randomly throws errors at points interspersed throughout worker and
leader unification hand-off points. I then leave this stress-test
build to run for a few hours, while monitoring for leaked files and
spurious fd.c reports of double-unlink() and similar issues. Test
builds change LOG to PANIC within several places in fd.c, while
MAX_PHYSICAL_FILESIZE was reduced from 1GiB to BLCKSZ.

All of these guarantees are made without any special care from caller
to buffile.c. The only V9 change to tuplesort.c or logtape.c in this
general area is that they have to pass a dynamic shared memory segment
to buffile.c, so that it can register a new callback. That's it. This
may be of particular interest to Thomas. All complexity is confined to
buffile.c.

* No expansion in the use of shared memory to manage resources.
BufFile refcount is still per-worker. The role of local resource
managers is unchanged.

* Additional complexity over and above ordinary BufFile resource
management is confined to the leader process and its on_dsm_detach()
callback. Only the leader registers a callback. Of course, refcount
management within BufFileClose() can still take place in workers, but
that isn't something that we rely on (that's only for non-error
paths). In general, worker processes mostly have resource managers
managing their temp file segments as a thing that has nothing to do
with BufFiles (BufFiles are still not owned by resowner.c/fd.c --
they're blissfully unaware of all of this stuff).

* In general, unified BufFiles can still be treated in exactly the
same way as conventional BufFiles, and things just work, without any
special cases being exercised internally.

There is still an open item here, though: The leader-as-worker
Tuplesortstate, a special case, can still leak files. So,
stress-testing will only show the patch to be completely robust
against resource leaks when nbtsort.c is modified to enable
FORCE_SINGLE_WORKER testing. Despite the name FORCE_SINGLE_WORKER, you
can also modify that file to force there to be arbitrary-many workers
requested (just change "requested = 1" to something else). The
leader-as-worker problem is avoided because we don't have the leader
participating as a worker this way, which would otherwise present
issues for resowner.c that I haven't got around to fixing just yet. It
isn't hard to imagine why this is -- one backend with two FDs for
certain fd.c temp segments is just going to cause problems for
resowner.c without additional special care. Didn't seem worth blocking
on that. I want to prove that my general approach is workable. That
problem is confined to one backend's resource manager when it is the
leader participating as a worker. It is not a refcount problem. The
simplest solution here would be to ban the leader-as-worker case by
contract. Alternatively, we could pass fd.c segments from the
leader-as-worker Tuplesortstate's BufFile to the leader
Tuplesortstate's BufFile without opening or closing anything. This
way, there will be no second vFD entry for any segment at any time.

I've also made several changes to the cost model, changes agreed to
over on the "Cost model for parallel CREATE INDEX" thread. No need for
a recap on what those changes are here. In short, things have been
*significantly* simplified in that area.

Finally, note that I decided to throw out more code within
tuplesort.c. Now, a parallel leader is a thing that is explicitly set
up to be exactly consistent with a conventional/serial external sort
whose merge is about to begin. In particular, it now uses mergeruns().

Robert said that he thinks that this is a patch that is to some degree
a parallelism patch, and to some degree about sorting. I'd say that by
now, it's roughly 5% about sorting, in terms of the proportion of code
that expressly considers sorting. Most of the new stuff in tuplesort.c
is about managing dependencies between participating backends. I've
really focused on avoiding new special cases, especially with V9.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> There is still an open item here, though: The leader-as-worker
> Tuplesortstate, a special case, can still leak files.

I phrased this badly. What I mean is that there can be instances where
temp files are left on disk following a failure such as palloc() OOM;
no backend ends up doing an unlink() iff a leader-as-worker
Tuplesortstate was used and we get unlucky. I did not mean a leak of
virtual or real file descriptors, which would see Postgres print a
refcount leak warning from resowner.c. Naturally, these "leaked" files
will eventually be deleted by the next restart of the server at the
latest, within RemovePgTempFiles(). Note also that a duplicate
unlink() (with annoying LOG message) is impossible under any
circumstances with V9, regardless of whether or not a leader-as-worker
Tuplesort state is involved.

Anyway, I was sure that I needed to completely nail this down in order
to be consistent with existing guarantees, but another look at
OpenTemporaryFile() makes me doubt that. ResourceOwnerEnlargeFiles()
is called, which itself uses palloc(), which can of course fail. There
are remarks over that function within resowner.c about OOM:

/** Make sure there is room for at least one more entry in a ResourceOwner's* files reference array.** This is separate
fromactually inserting an entry because if we run out* of memory, it's critical to do so *before* acquiring the
resource.*/
void
ResourceOwnerEnlargeFiles(ResourceOwner owner)
{   ...
}

But this happens after OpenTemporaryFileInTablespace() has already
returned. Taking care to allocate memory up-front here is motivated by
keeping the vFD cache entry and current resource owner in perfect
agreement about the FD_XACT_TEMPORARY-ness of a file, and that's it.
It's *not* true that there is a broader sense in which
OpenTemporaryFile() is atomic, which for some reason I previously
believed to be the case.

So, I haven't failed to prevent an outcome that wasn't already
possible. It doesn't seem like it would be that hard to fix this, and
then have the parallel tuplesort patch live up to that new higher
standard. But, it's possible that Tom or maybe someone else would
consider that a bad idea, for roughly the same reason that we don't
call RemovePgTempFiles() for *crash* induced restarts, as mentioned by
Thomas up-thead:
* NOTE: we could, but don't, call this during a post-backend-crash restart* cycle.  The argument for not doing it is
thatsomeone might want to examine* the temp files for debugging purposes.  This does however mean that*
OpenTemporaryFilehad better allow for collision with an existing temp* file name.*/
 
void
RemovePgTempFiles(void)
{   ...
}

Note that I did put some thought into making sure OpenTemporaryFile()
does the right thing with collisions with existing temp files. So,
maybe the right thing is to do nothing at all. I don't have strong
feelings either way on this question.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I attach my V9 of the patch. I came up some stuff for the design of
> resource management that I think meets every design goal that we have
> for shared/unified BufFiles:

Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should
now pass -1 as the index block argument to compute_parallel_worker(),
just as all callers that aren't parallel index scan do after that
commit. This issue caused V9 to never choose parallel CREATE INDEX
within nbtsort.c. There was also a small amount of bitrot.

Attached V10 fixes this regression. I also couldn't resist adding a
few new assertions that I thought were worth having to buffile.c, plus
dedicated wait events for parallel tuplesort. And, I fixed a silly bug
added in V9 around where worker_wait() should occur.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Sun, Mar 19, 2017 at 9:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I attach my V9 of the patch. I came up some stuff for the design of
>> resource management that I think meets every design goal that we have
>> for shared/unified BufFiles:
>
> Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should
> now pass -1 as the index block argument to compute_parallel_worker(),
> just as all callers that aren't parallel index scan do after that
> commit. This issue caused V9 to never choose parallel CREATE INDEX
> within nbtsort.c. There was also a small amount of bitrot.
>
> Attached V10 fixes this regression. I also couldn't resist adding a
> few new assertions that I thought were worth having to buffile.c, plus
> dedicated wait events for parallel tuplesort. And, I fixed a silly bug
> added in V9 around where worker_wait() should occur.

Some initial review comments:

- * This code is moderately slow (~10% slower) compared to the regular
- * btree (insertion) build code on sorted or well-clustered data.  On
- * random data, however, the insertion build code is unusable -- the
- * difference on a 60MB heap is a factor of 15 because the random
- * probes into the btree thrash the buffer pool.  (NOTE: the above
- * "10%" estimate is probably obsolete, since it refers to an old and
- * not very good external sort implementation that used to exist in
- * this module.  tuplesort.c is almost certainly faster.)

While I agree that the old comment is probably inaccurate, I don't
think dropping it without comment in a patch to implement parallel
sorting is the way to go. How about updating it to be more current as
a separate patch?

+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_BTREE_SHARED              UINT64CONST(0xA000000000000001)
+#define PARALLEL_KEY_TUPLESORT                 UINT64CONST(0xA000000000000002)
+#define PARALLEL_KEY_TUPLESORT_SPOOL2  UINT64CONST(0xA000000000000003)

1, 2, and 3 would probably work just as well.  The parallel
infrastructure uses high-numbered values to avoid conflict with
plan_node_id values, but this is a utility statement so there's no
such problem.  But it doesn't matter very much.

+ * Note: caller had better already hold some type of lock on the table and
+ * index.
+ */
+int
+plan_create_index_workers(Oid tableOid, Oid indexOid)

Caller should pass down the Relation rather than the Oid.  That is
better both because it avoids unnecessary work and because it more or
less automatically avoids the problem mentioned in the note.

Why is this being put in planner.c rather than something specific to
creating indexes?  Not sure that's a good idea.

+ * This should be called when workers have flushed out temp file buffers and
+ * yielded control to caller's process.  Workers should hold open their
+ * BufFiles at least until the caller's process is able to call here and
+ * assume ownership of BufFile.  The general pattern is that workers make
+ * available data from their temp files to one nominated process; there is
+ * no support for workers that want to read back data from their original
+ * BufFiles following writes performed by the caller, or any other
+ * synchronization beyond what is implied by caller contract.  All
+ * communication occurs in one direction.  All output is made available to
+ * caller's process exactly once by workers, following call made here at the
+ * tail end of processing.

Thomas has designed a system for sharing files among cooperating
processes that lacks several of these restrictions.  With his system,
it's still necessary for all data to be written and flushed by the
writer before anybody tries to read it.  But the restriction that the
worker has to hold its BufFile open until the leader can assume
ownership goes away.  That's a good thing; it avoids the need for
workers to sit around waiting for the leader to assume ownership of a
resource instead of going away faster and freeing up worker slots for
some other query, or moving on to some other computation.   The
restriction that the worker can't reread the data after handing off
the file also goes away.  The files can be read and written by any
participant in any order, as many times as you like, with only the
restriction that the caller must guarantee that data will be written
and flushed from private buffers before it can be read.  I don't see
any reason to commit both his system and your system, and his is more
general so I think you should use it.  That would cut hundreds of
lines from this patch with no real disadvantage that I can see --
including things like worker_wait(), which are only needed because of
the shortcomings of the underlying mechanism.

+ * run.  Parallel workers always use quicksort, however.

Comment fails to mention a reason.

+        elog(LOG, "%d using " INT64_FORMAT " KB of memory for read
buffers among %d input tapes",
+             state->worker, state->availMem / 1024, numInputTapes);

I think "worker %d" or "participant %d" would be a lot better than
just starting the message with "%d".  (There are multiple instances of
this, with various messages.)

I think some of the smaller changes that this patch makes, like
extending the parallel context machinery to support SnapshotAny, could
be usefully broken out as separately-committable patches.

I haven't really dug down into the details here, but with the
exception of the buffile.c stuff which I don't like, the overall
design of this seems pretty sensible to me.  We might eventually want
to do something more clever at the sorting level, but those changes
would be confined to tuplesort.c, and all the other changes you've
introduced here would stand on their own.  Which is to say that even
if there's more win to be had here, this is a good start.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Mar 21, 2017 at 9:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> - * This code is moderately slow (~10% slower) compared to the regular
> - * btree (insertion) build code on sorted or well-clustered data.  On
> - * random data, however, the insertion build code is unusable -- the
> - * difference on a 60MB heap is a factor of 15 because the random
> - * probes into the btree thrash the buffer pool.  (NOTE: the above
> - * "10%" estimate is probably obsolete, since it refers to an old and
> - * not very good external sort implementation that used to exist in
> - * this module.  tuplesort.c is almost certainly faster.)
>
> While I agree that the old comment is probably inaccurate, I don't
> think dropping it without comment in a patch to implement parallel
> sorting is the way to go. How about updating it to be more current as
> a separate patch?

I think that since the comment refers to code from before 1999, it can
go. Any separate patch to remove it would have an entirely negative
linediff.

> +/* Magic numbers for parallel state sharing */

> 1, 2, and 3 would probably work just as well.

Okay.

> Why is this being put in planner.c rather than something specific to
> creating indexes?  Not sure that's a good idea.

The idea is that it's the planner's domain, but this is a utility
statement, so it makes sense to put it next to the CLUSTER function
that determine whether CLUSTER sorts rather than does an index scan. I
don't have strong feelings on how appropriate that is.

> + * This should be called when workers have flushed out temp file buffers and
> + * yielded control to caller's process.  Workers should hold open their
> + * BufFiles at least until the caller's process is able to call here and
> + * assume ownership of BufFile.  The general pattern is that workers make
> + * available data from their temp files to one nominated process; there is
> + * no support for workers that want to read back data from their original
> + * BufFiles following writes performed by the caller, or any other
> + * synchronization beyond what is implied by caller contract.  All
> + * communication occurs in one direction.  All output is made available to
> + * caller's process exactly once by workers, following call made here at the
> + * tail end of processing.
>
> Thomas has designed a system for sharing files among cooperating
> processes that lacks several of these restrictions.  With his system,
> it's still necessary for all data to be written and flushed by the
> writer before anybody tries to read it.  But the restriction that the
> worker has to hold its BufFile open until the leader can assume
> ownership goes away.  That's a good thing; it avoids the need for
> workers to sit around waiting for the leader to assume ownership of a
> resource instead of going away faster and freeing up worker slots for
> some other query, or moving on to some other computation.   The
> restriction that the worker can't reread the data after handing off
> the file also goes away.

There is no restriction about workers not being able to reread data.
That comment makes it clear that that's only when the leader writes to
the file. It alludes to rereading within a worker following the leader
writing to their files in order to recycle blocks within logtape.c,
which the patch never has to do, unless you enable one of the 0002-*
testing GUCs to force randomAccess.

Obviously iff you write to the file in the leader, there is little
that the worker can do afterwards, but it's not a given that you'd
want to do that, and this patch actually never does. You could equally
well say that PHJ fails to provide for my requirement for having the
leader write to the files sensibly in order to recycle blocks, a
requirement that its shared BufFile mechanism expressly does not
support.

> That would cut hundreds of
> lines from this patch with no real disadvantage that I can see --
> including things like worker_wait(), which are only needed because of
> the shortcomings of the underlying mechanism.

I think it would definitely be a significant net gain in LOC. And,
worker_wait() will probably be replaced by the use of the barrier
abstraction anyway. It didn't seem worth creating a dependency on
early given my simple requirements. PHJ uses barriers instead,
presumably because there is much more of this stuff. The workers
generally won't have to wait at all. It's expected to be pretty much
instantaneous.

> + * run.  Parallel workers always use quicksort, however.
>
> Comment fails to mention a reason.

Well, I don't think that there is any reason to use replacement
selection at all, what with the additional merge heap work last year.
But, the theory there remains that RS is good when you can get one big
run and no merge. You're not going to get that with parallel sort in
any case, since the leader must merge. Besides, merging in the workers
happens in the workers. And, the backspace requirement of 32MB of
workMem per participant pretty much eliminates any use of RS that
you'd get otherwise.

> I think "worker %d" or "participant %d" would be a lot better than
> just starting the message with "%d".  (There are multiple instances of
> this, with various messages.)

Okay.

> I think some of the smaller changes that this patch makes, like
> extending the parallel context machinery to support SnapshotAny, could
> be usefully broken out as separately-committable patches.

Okay.

> I haven't really dug down into the details here, but with the
> exception of the buffile.c stuff which I don't like, the overall
> design of this seems pretty sensible to me.  We might eventually want
> to do something more clever at the sorting level, but those changes
> would be confined to tuplesort.c, and all the other changes you've
> introduced here would stand on their own.  Which is to say that even
> if there's more win to be had here, this is a good start.

That's certainly how I feel about it.

I believe that the main reason that you like the design I came up with
on the whole is that it's minimally divergent from the serial case.
The changes in logtape.c and tuplesort.c are actually very minor. But,
the reason that that's possible at all is because buffile.c adds some
complexity that is all about maintaining existing assumptions. You
don't like that complexity. I would suggest that it's useful that I've
been able to isolate it to buffile.c fairly well.

A quick tally of the existing assumptions this patch preserves:

1. Resource managers still work as before. This means that error
handling will work the same way as before. We cooperate with that
mechanism, rather than supplanting it entirely.

2. There is only one BufFile per logical tapeset per tuplesort, in
both workers and the leader.

3. You can write to the end of a unified BufFile in leader to have it
extended, while resource managers continue to do the right thing
despite differing requirements for each segment. This leaves things
sane for workers to read, provided the leader keeps to its own space
in the unified BufFile.

4. Temp files must go away at EoX, no matter what.

Thomas has created a kind of shadow resource manager in shared memory.
So, he isn't using fd.c resource management stuff. He is concerned
with a set of BufFiles, each of which has specific significance to
each parallel hash join (they're per worker HJ batch). PHJ has an
unpredictable number of BufFiles, while parallel tuplesort always has
one, just as before. For the most part, I think that what Thomas has
done reflects his own requirements, just as what I've done reflects my
requirements. There seems to be no excellent opportunity to use a
common infrastructure.

I think that not cooperating with the existing mechanism will prove to
be buggy. Following a quick look at the latest PHJ patch series, and
its 0008-hj-shared-buf-file-v8.patch file, I already see one example.
I notice that there could be multiple calls to
pgstat_report_tempfile() within each backend for the same BufFile
segment. Isn't that counting the same thing more than once? In
general, it seems problematic that there is now "true" fd.c temp
segments, as well as Shared BufFile temp segments that are never in a
backend resource manager.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Mar 21, 2017 at 2:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think that since the comment refers to code from before 1999, it can
> go. Any separate patch to remove it would have an entirely negative
> linediff.

It's a good general principle that a patch should do one thing well
and not make unrelated changes.  I try hard to adhere to that
principle in my commits, and I think other committers generally do
(and should), too.  Of course, different people draw the line in
different places.  If you can convince another committer to include
that change in their commit of this patch, well, that's not my cup of
tea, but so be it.  If you want me to consider committing this, you're
going to have to submit that part separately, preferably on a separate
thread with a suitably descriptive subject line.

> Obviously iff you write to the file in the leader, there is little
> that the worker can do afterwards, but it's not a given that you'd
> want to do that, and this patch actually never does. You could equally
> well say that PHJ fails to provide for my requirement for having the
> leader write to the files sensibly in order to recycle blocks, a
> requirement that its shared BufFile mechanism expressly does not
> support.

From my point of view, the main point is that having two completely
separate mechanisms for managing temporary files that need to be
shared across cooperating workers is not a good decision.  That's a
need that's going to come up over and over again, and it's not
reasonable for everybody who needs it to add a separate mechanism for
doing it.  We need to have ONE mechanism for it.

The second point is that I'm pretty convinced that the design you've
chosen is fundamentally wrong.  I've attempt to explain that multiple
times, starting about three months ago with
http://postgr.es/m/CA+TgmoYP0vzPw64DfMQT1JHY6SzyAvjogLkj3erMZzzN2f9xLA@mail.gmail.com
and continuing across many subsequent emails on multiple threads.
It's just not OK in my book for a worker to create something that it
initially owns and then later transfer it to the leader.  The
cooperating backends should have joint ownership of the objects from
the beginning, and the last process to exit the set should clean up
those resources.

>> That would cut hundreds of
>> lines from this patch with no real disadvantage that I can see --
>> including things like worker_wait(), which are only needed because of
>> the shortcomings of the underlying mechanism.
>
> I think it would definitely be a significant net gain in LOC. And,
> worker_wait() will probably be replaced by the use of the barrier
> abstraction anyway.

No, because if you do it Thomas's way, the worker can exit right away,
without waiting.  You don't have to wait via a different method; you
escape waiting altogether.  I understand that your point is that the
wait will always be brief, but I think that's probably an optimistic
assumption and definitely an unnecessary assumption.  It's optimistic
because there is absolutely no guarantee that all workers will take
the same amount of time to sort the data they read.  It is absolutely
not the case that all data sets sort at the same speed.  Because of
the way parallel sequential scan works, we're somewhat insulated from
that; workers that sort faster will get a larger chunk of the table.
However, that only means that workers will finish generating their
sorted runs at about the same time, not that they will finish merging
at the same time.  And, indeed, if some workers end up with more data
than others (so that they finish building runs at about the same time)
then some will probably take longer to complete the merging than
others.

But even if were true that the waits will always be brief, I still
think the way you've done it is a bad idea, because now tuplesort.c
has to know that it needs to wait because of some detail of
lower-level resource management about which it should not have to
care.  That alone is a sufficient reason to want a better approach.

I completely accept that whatever abstraction we use at the BufFile
level has to be something that can be plumbed into logtape.c, and if
Thomas's mechanism can't be bolted in there in a sensible way then
that's a problem.  But I feel quite strongly that the solution to that
problem isn't to adopt the approach you've taken here.

>> + * run.  Parallel workers always use quicksort, however.
>>
>> Comment fails to mention a reason.
>
> Well, I don't think that there is any reason to use replacement
> selection at all, what with the additional merge heap work last year.
> But, the theory there remains that RS is good when you can get one big
> run and no merge. You're not going to get that with parallel sort in
> any case, since the leader must merge. Besides, merging in the workers
> happens in the workers. And, the backspace requirement of 32MB of
> workMem per participant pretty much eliminates any use of RS that
> you'd get otherwise.

So, please mention that briefly in the comment.

> I believe that the main reason that you like the design I came up with
> on the whole is that it's minimally divergent from the serial case.

That's part of it, I guess, but it's more that the code you've added
to do parallelism here looks an awful lot like what's gotten added to
do parallelism in other cases, like parallel query.  That's probably a
good sign.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> From my point of view, the main point is that having two completely
> separate mechanisms for managing temporary files that need to be
> shared across cooperating workers is not a good decision.  That's a
> need that's going to come up over and over again, and it's not
> reasonable for everybody who needs it to add a separate mechanism for
> doing it.  We need to have ONE mechanism for it.

Obviously I understand that there is value in code reuse in general.
The exact extent to which code reuse is possible here has been unclear
throughout, because it's complicated for all kinds of reasons. That's
why Thomas and I had 2 multi-hour Skype calls all about it.

> It's just not OK in my book for a worker to create something that it
> initially owns and then later transfer it to the leader.

Isn't that an essential part of having a refcount, in general? You
were the one that suggested refcounting.

> The cooperating backends should have joint ownership of the objects from
> the beginning, and the last process to exit the set should clean up
> those resources.

That seems like a facile summary of the situation. There is a sense in
which there is always joint ownership of files with my design. But
there is also a sense is which there isn't, because it's impossible to
do that while not completely reinventing resource management of temp
files. I wanted to preserve resowner.c ownership of fd.c segments.

You maintain that it's better to have the leader unlink() everything
at the end, and suppress the errors when that doesn't work, so that
that path always just plows through. I disagree with that. It is a
trade-off, I suppose. I have now run out of time to work through it
with you or Thomas, though.

> But even if were true that the waits will always be brief, I still
> think the way you've done it is a bad idea, because now tuplesort.c
> has to know that it needs to wait because of some detail of
> lower-level resource management about which it should not have to
> care.  That alone is a sufficient reason to want a better approach.

There is already a point at which the leader needs to wait, so that it
can accumulate stats that nbtsort.c cares about. So we already need a
leader wait point within nbtsort.c (that one is called directly by
nbtsort.c). Doesn't seem like too bad of a wart to have the same thing
for workers.

>> I believe that the main reason that you like the design I came up with
>> on the whole is that it's minimally divergent from the serial case.
>
> That's part of it, I guess, but it's more that the code you've added
> to do parallelism here looks an awful lot like what's gotten added to
> do parallelism in other cases, like parallel query.  That's probably a
> good sign.

It's also a good sign that it makes CREATE INDEX approximately 3 times faster.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> From my point of view, the main point is that having two completely
>> separate mechanisms for managing temporary files that need to be
>> shared across cooperating workers is not a good decision.  That's a
>> need that's going to come up over and over again, and it's not
>> reasonable for everybody who needs it to add a separate mechanism for
>> doing it.  We need to have ONE mechanism for it.
>
> Obviously I understand that there is value in code reuse in general.
> The exact extent to which code reuse is possible here has been unclear
> throughout, because it's complicated for all kinds of reasons. That's
> why Thomas and I had 2 multi-hour Skype calls all about it.

I agree that the extent to which code reuse is possible here is
somewhat unclear, but I am 100% confident that the answer is non-zero.
You and Thomas both need BufFiles that can be shared across multiple
backends associated with the same ParallelContext.  I don't understand
how you can argue that it's reasonable to have two different ways of
sharing the same kind of object across the same set of processes.  And
if that's not reasonable, then somehow we need to come up with a
single mechanism that can meet both your requirements and Thomas's
requirements.

>> It's just not OK in my book for a worker to create something that it
>> initially owns and then later transfer it to the leader.
>
> Isn't that an essential part of having a refcount, in general? You
> were the one that suggested refcounting.

No, quite the opposite.  My point in suggesting adding a refcount was
to avoid needing to have a single owner.  Instead, the process that
decrements the reference count to zero becomes responsible for doing
the cleanup.  What you've done with the ref count is use it as some
kind of medium for transferring responsibility from backend A to
backend B; what I want is to allow backends A, B, C, D, E, and F to
attach to the same shared resource, and whichever one of them happens
to be the last one out of the room shuts off the lights.

>> The cooperating backends should have joint ownership of the objects from
>> the beginning, and the last process to exit the set should clean up
>> those resources.
>
> That seems like a facile summary of the situation. There is a sense in
> which there is always joint ownership of files with my design. But
> there is also a sense is which there isn't, because it's impossible to
> do that while not completely reinventing resource management of temp
> files. I wanted to preserve resowner.c ownership of fd.c segments.

As I've said before, I think that's an anti-goal.  This is a different
problem, and trying to reuse the solution we chose for the
non-parallel case doesn't really work.  resowner.c could end up owning
a shared reference count which it's responsible for decrementing --
and then decrementing it removes the file if the result is zero.  But
it can't own performing the actual unlink(), because then we can't
support cases where the file may have multiple readers, since whoever
owns the unlink() might try to zap the file out from under one of the
others.

> You maintain that it's better to have the leader unlink() everything
> at the end, and suppress the errors when that doesn't work, so that
> that path always just plows through.

I don't want the leader to be responsible for anything.  I want the
last process to detach to be responsible for cleanup, regardless of
which process that ends up being.  I want that for lots of good
reasons which I have articulated including (1) it's how all other
resource management for parallel query already works, e.g. DSM, DSA,
and group locking; (2) it avoids the need for one process to sit and
wait until another process assumes ownership, which isn't a feature
even if (as you contend, and I'm not convinced) it doesn't hurt much;
and (3) it allows for use cases where multiple processes are reading
from the same shared BufFile without the risk that some other process
will try to unlink() the file while it's still in use.  The point for
me isn't so much whether unlink() ever ignores errors as whether
cleanup (however defined) is an operation guaranteed to happen exactly
once.

> I disagree with that. It is a
> trade-off, I suppose. I have now run out of time to work through it
> with you or Thomas, though.

Bummer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Mar 22, 2017 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I disagree with that. It is a
>> trade-off, I suppose. I have now run out of time to work through it
>> with you or Thomas, though.
>
> Bummer.

I'm going to experiment with refactoring the v10 parallel CREATE INDEX
patch to use the SharedBufFileSet interface from
hj-shared-buf-file-v8.patch today and see what problems I run into.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Mar 21, 2017 at 2:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I agree that the extent to which code reuse is possible here is
> somewhat unclear, but I am 100% confident that the answer is non-zero.
> You and Thomas both need BufFiles that can be shared across multiple
> backends associated with the same ParallelContext.  I don't understand
> how you can argue that it's reasonable to have two different ways of
> sharing the same kind of object across the same set of processes.

I didn't argue that. Rather, I argued that there is going to be
significant additional requirements for PHJ, because it has to support
arbitrary many BufFiles, rather than either 1 or 2 (one per
tuplesort/logtapeset). Just how "signficant" that would be I cannot
say, regrettably. (Or, we're going to have to make logtape.c multiplex
BufFiles, which risks breaking other logtape.c routines that aren't
even used just yet.)

>> Isn't that an essential part of having a refcount, in general? You
>> were the one that suggested refcounting.
>
> No, quite the opposite.  My point in suggesting adding a refcount was
> to avoid needing to have a single owner.  Instead, the process that
> decrements the reference count to zero becomes responsible for doing
> the cleanup.  What you've done with the ref count is use it as some
> kind of medium for transferring responsibility from backend A to
> backend B; what I want is to allow backends A, B, C, D, E, and F to
> attach to the same shared resource, and whichever one of them happens
> to be the last one out of the room shuts off the lights.

Actually, that's quite possible with the design I came up with. The
restriction that Thomas can't live with as I've left things is that
you have to know the number of BufFiles ahead of time. I'm pretty sure
that that's all it is. (I do sympathize with the fact that that isn't
very helpful to him, though.)

> As I've said before, I think that's an anti-goal.  This is a different
> problem, and trying to reuse the solution we chose for the
> non-parallel case doesn't really work.  resowner.c could end up owning
> a shared reference count which it's responsible for decrementing --
> and then decrementing it removes the file if the result is zero.  But
> it can't own performing the actual unlink(), because then we can't
> support cases where the file may have multiple readers, since whoever
> owns the unlink() might try to zap the file out from under one of the
> others.

Define "zap the file". I think, based on your remarks here, that
you've misunderstood my design. I think you should at least understand
it fully if you're going to dismiss it.

It is true that a worker resowner can unlink() the files
mid-unification, in the same manner as with conventional temp files,
and not decrement its refcount in shared memory, or care at all in any
special way. This is okay because the leader (in the case of parallel
tuplesort) will realize that it should not "turn out the lights",
finding that remaining reference when it calls BufFileClose() in
registered callback, as it alone must. It doesn't matter that the
unlink() may have already occurred, or may be just about to occur,
because we are only operating on already-opened files, and never on
the link itself (we don't have to stat() the file link for example,
which is naturally only a task for the unlink()'ing backend anyway).
You might say that the worker only blows away the link itself, not the
file proper, since it may still be open in leader (say).

** We rely on the fact that files are themselves a kind of reference
counted thing, in general; they have an independent existence from the
link originally used to open() them. **

The reason that there is a brief wait in workers for parallel
tuplesort is because it gives us the opportunity to have the
immediately subsequent worker BufFileClose() not turn out the lights
in worker, because leader must have a reference on the BufFile when
workers are released. So, there is a kind of interlock that makes sure
that there is always at least 1 owner.

** There would be no need for an additional wait but for the fact the
leader wants to unify multiple worker BufFiles as one, and must open
them all at once for the sake of simplicity. But that's just how
parallel tuplesort in particular happens to work, since it has only
one BufFile in the leader, which it wants to operate on with
everything set up up-front. **

Thomas' design cannot reliably know how many segments there are in
workers in error paths, which necessitates his unlink()-ENOENT-ignore
hack. My solution is that workers/owners look after their own temp
segments in the conventional way, until they reach BufFileClose(),
which may never come if there is an error. The only way that clean-up
won't happen in conventional resowner.c-in-worker fashion is if
BufFileClose() is reached in owner/worker. BufFileClose() must be
reached when there is no error, which has to happen anyway when using
temp files. (Else there is a temp file leak warning from resowner.c.)

This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
since only the worker itself can reliably know how many segments it
has opened at every single instant in time. Because it's the owner!

>> You maintain that it's better to have the leader unlink() everything
>> at the end, and suppress the errors when that doesn't work, so that
>> that path always just plows through.
>
> I don't want the leader to be responsible for anything.

I meant in the case of parallel CREATE INDEX specifically, were it to
use this other mechanism. Substitute "leader" with "the last backend"
in reading my remarks here.

> I want the
> last process to detach to be responsible for cleanup, regardless of
> which process that ends up being.  I want that for lots of good
> reasons which I have articulated including (1) it's how all other
> resource management for parallel query already works, e.g. DSM, DSA,
> and group locking; (2) it avoids the need for one process to sit and
> wait until another process assumes ownership, which isn't a feature
> even if (as you contend, and I'm not convinced) it doesn't hurt much;
> and (3) it allows for use cases where multiple processes are reading
> from the same shared BufFile without the risk that some other process
> will try to unlink() the file while it's still in use.  The point for
> me isn't so much whether unlink() ever ignores errors as whether
> cleanup (however defined) is an operation guaranteed to happen exactly
> once.

My patch demonstrably has these properties. I've done quite a bit of
fault injection testing to prove it.

(Granted, I need to take extra steps for the leader-as-worker backend,
a special case, which I haven't done already because I was waiting on
your feedback on the appropriate trade-off there.)

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Mar 21, 2017 at 2:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I'm going to experiment with refactoring the v10 parallel CREATE INDEX
> patch to use the SharedBufFileSet interface from
> hj-shared-buf-file-v8.patch today and see what problems I run into.

I would be happy if you took over parallel CREATE INDEX completely. It
makes a certain amount of sense, and not just because I am no longer
able to work on it.

You're the one doing things with shared BufFiles that are of
significant complexity. Certainly more complicated than what parallel
CREATE INDEX needs in every way, and necessarily so. I will still have
some more feedback on your shared BufFile design, though, while it's
fresh in my mind.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Mar 21, 2017 at 7:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> Isn't that an essential part of having a refcount, in general? You
>>> were the one that suggested refcounting.
>>
>> No, quite the opposite.  My point in suggesting adding a refcount was
>> to avoid needing to have a single owner.  Instead, the process that
>> decrements the reference count to zero becomes responsible for doing
>> the cleanup.  What you've done with the ref count is use it as some
>> kind of medium for transferring responsibility from backend A to
>> backend B; what I want is to allow backends A, B, C, D, E, and F to
>> attach to the same shared resource, and whichever one of them happens
>> to be the last one out of the room shuts off the lights.
>
> Actually, that's quite possible with the design I came up with.

I don't think it is.  What sequence of calls do the APIs you've
proposed would accomplish that goal?  I don't see anything in this
patch set that would permit anything other than a handoff from the
worker to the leader.  There seems to be no way for the ref count to
be more than 1 (or 2?).

> The
> restriction that Thomas can't live with as I've left things is that
> you have to know the number of BufFiles ahead of time. I'm pretty sure
> that that's all it is. (I do sympathize with the fact that that isn't
> very helpful to him, though.)

I feel like there's some cognitive dissonance here.  On the one hand,
you're saying we should use your design.  On the other hand, you are
admitting that in at least one key respect, it won't meet Thomas's
requirements.  On the third hand, you just said that you weren't
arguing for two mechanisms for sharing a BufFile across cooperating
parallel processes.  I don't see how you can hold all three of those
positions simultaneously.

>> As I've said before, I think that's an anti-goal.  This is a different
>> problem, and trying to reuse the solution we chose for the
>> non-parallel case doesn't really work.  resowner.c could end up owning
>> a shared reference count which it's responsible for decrementing --
>> and then decrementing it removes the file if the result is zero.  But
>> it can't own performing the actual unlink(), because then we can't
>> support cases where the file may have multiple readers, since whoever
>> owns the unlink() might try to zap the file out from under one of the
>> others.
>
> Define "zap the file". I think, based on your remarks here, that
> you've misunderstood my design. I think you should at least understand
> it fully if you're going to dismiss it.

zap was a colloquialism for unlink().  I concede that I don't fully
understand your design, and am trying to understand those things I do
not yet understand.

> It is true that a worker resowner can unlink() the files
> mid-unification, in the same manner as with conventional temp files,
> and not decrement its refcount in shared memory, or care at all in any
> special way. This is okay because the leader (in the case of parallel
> tuplesort) will realize that it should not "turn out the lights",
> finding that remaining reference when it calls BufFileClose() in
> registered callback, as it alone must. It doesn't matter that the
> unlink() may have already occurred, or may be just about to occur,
> because we are only operating on already-opened files, and never on
> the link itself (we don't have to stat() the file link for example,
> which is naturally only a task for the unlink()'ing backend anyway).
> You might say that the worker only blows away the link itself, not the
> file proper, since it may still be open in leader (say).

Well, that sounds like it's counting on fd.c not to close the file
descriptor at an inconvenient point in time and reopen it later, which
is not guaranteed.

> Thomas' design cannot reliably know how many segments there are in
> workers in error paths, which necessitates his unlink()-ENOENT-ignore
> hack. My solution is that workers/owners look after their own temp
> segments in the conventional way, until they reach BufFileClose(),
> which may never come if there is an error. The only way that clean-up
> won't happen in conventional resowner.c-in-worker fashion is if
> BufFileClose() is reached in owner/worker. BufFileClose() must be
> reached when there is no error, which has to happen anyway when using
> temp files. (Else there is a temp file leak warning from resowner.c.)
>
> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
> since only the worker itself can reliably know how many segments it
> has opened at every single instant in time. Because it's the owner!

Above, you said that your design would allow for a group of processes
to share access to a file, with the last one that abandons it "turning
out the lights".  But here, you are referring to it as having one
owner - the "only the worker itself" can know the number of segments.
Those things are exact opposites of each other.

I don't think there's any problem with ignoring ENOENT, and I don't
think there's any need for a process to know the exact number of
segments in some temporary file.  In a shared-ownership environment,
that information can't be stored in a backend-private cache; it's got
to be available to whichever backend ends up being the last one out.
There are only two ways to do that.  One is to store it in shared
memory, and the other is to discover it from the filesystem.  The
former is conceptually more appealing, but it can't handle Thomas's
requirement of an unlimited number of files, so I think it makes sense
to go with the latter.  The only problem with that which I can see is
that we might orphan some temporary files if the disk is flaky and
filesystem operations are failing intermittently, but that's already a
pretty bad situation which we're not going to make much worse with
this approach.

>> I want the
>> last process to detach to be responsible for cleanup, regardless of
>> which process that ends up being.  I want that for lots of good
>> reasons which I have articulated including (1) it's how all other
>> resource management for parallel query already works, e.g. DSM, DSA,
>> and group locking; (2) it avoids the need for one process to sit and
>> wait until another process assumes ownership, which isn't a feature
>> even if (as you contend, and I'm not convinced) it doesn't hurt much;
>> and (3) it allows for use cases where multiple processes are reading
>> from the same shared BufFile without the risk that some other process
>> will try to unlink() the file while it's still in use.  The point for
>> me isn't so much whether unlink() ever ignores errors as whether
>> cleanup (however defined) is an operation guaranteed to happen exactly
>> once.
>
> My patch demonstrably has these properties. I've done quite a bit of
> fault injection testing to prove it.

I don't understand this comment, because 0 of the 3 properties that I
just articulated are things which can be proved or disproved by fault
injection.  Fault injection can confirm the presence of bugs or
suggest their absence, but none of those properties have to do with
whether there are bugs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Mar 22, 2017 at 5:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Actually, that's quite possible with the design I came up with.
>
> I don't think it is.  What sequence of calls do the APIs you've
> proposed would accomplish that goal?  I don't see anything in this
> patch set that would permit anything other than a handoff from the
> worker to the leader.  There seems to be no way for the ref count to
> be more than 1 (or 2?).

See my remarks on this below.

>> The
>> restriction that Thomas can't live with as I've left things is that
>> you have to know the number of BufFiles ahead of time. I'm pretty sure
>> that that's all it is. (I do sympathize with the fact that that isn't
>> very helpful to him, though.)
>
> I feel like there's some cognitive dissonance here.  On the one hand,
> you're saying we should use your design.

No, I'm not. I'm saying that my design is complete on its own terms,
and has some important properties that a mechanism like this ought to
have. I think I've been pretty clear on my general uncertainty about
the broader question.

> On the other hand, you are
> admitting that in at least one key respect, it won't meet Thomas's
> requirements.  On the third hand, you just said that you weren't
> arguing for two mechanisms for sharing a BufFile across cooperating
> parallel processes.  I don't see how you can hold all three of those
> positions simultaneously.

I respect your position as the person that completely owns parallelism
here. You are correct when you say that there has to be some overlap
between the requirements for the mechanisms used by each patch --
there just *has* to be. As I said, I only know very approximately how
much overlap that is or should be, even at this late date, and I am
unfortunately not in a position to spend more time on it to find out.
C'est la vie.

I know that I have no chance of convincing you to adopt my design
here, and you are right not to accept the design, because there is a
bigger picture. And, because it's just too late now. My efforts to get
ahead of that, and anticipate and provide for Thomas' requirements
have failed. I admit that. But, you are asserting that my patch has
specific technical defects that it does not have.

I structured things this way for a reason. You are not required to
agree with me in full to see that I might have had a point. I've
described it as a trade-off already. I think that it will be of
practical value to you to see that trade-off. This insight is what
allowed me to immediately zero in on resource leak bugs in Thomas'
revision of the patch from yesterday.

>> It is true that a worker resowner can unlink() the files
>> mid-unification, in the same manner as with conventional temp files,
>> and not decrement its refcount in shared memory, or care at all in any
>> special way. This is okay because the leader (in the case of parallel
>> tuplesort) will realize that it should not "turn out the lights",
>> finding that remaining reference when it calls BufFileClose() in
>> registered callback, as it alone must. It doesn't matter that the
>> unlink() may have already occurred, or may be just about to occur,
>> because we are only operating on already-opened files, and never on
>> the link itself (we don't have to stat() the file link for example,
>> which is naturally only a task for the unlink()'ing backend anyway).
>> You might say that the worker only blows away the link itself, not the
>> file proper, since it may still be open in leader (say).
>
> Well, that sounds like it's counting on fd.c not to close the file
> descriptor at an inconvenient point in time and reopen it later, which
> is not guaranteed.

It's true that in an error path, if the FD of the file we just opened
gets swapped out, that could happen. That seems virtually impossible,
and in any case the consequence is no worse than a confusing LOG
message. But, yes, that's a weakness.

>> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
>> since only the worker itself can reliably know how many segments it
>> has opened at every single instant in time. Because it's the owner!
>
> Above, you said that your design would allow for a group of processes
> to share access to a file, with the last one that abandons it "turning
> out the lights".  But here, you are referring to it as having one
> owner - the "only the worker itself" can know the number of segments.
> Those things are exact opposites of each other.

You misunderstood.

Under your analogy, the worker needs to wait for someone else to enter
the room before leaving, because otherwise, as an "environmentally
conscious" worker, it would be compelled to turn the lights out before
anyone else ever got to do anything with its files. But once someone
else is in the room, the worker is free to leave without turning out
the lights. I could provide a mechanism for the leader, or whatever
the other backend is, to do another hand off. You're right that that
is left unimplemented, but it would be a trivial adjunct to what I
came up with.

> I don't think there's any problem with ignoring ENOENT, and I don't
> think there's any need for a process to know the exact number of
> segments in some temporary file.

You may well be right, but that is just one detail.

>> My patch demonstrably has these properties. I've done quite a bit of
>> fault injection testing to prove it.
>
> I don't understand this comment, because 0 of the 3 properties that I
> just articulated are things which can be proved or disproved by fault
> injection.  Fault injection can confirm the presence of bugs or
> suggest their absence, but none of those properties have to do with
> whether there are bugs.

I was unclear -- I just meant (3). Specifically, that resource
ownership has been shown to do be robust under stress testing/fault
injection testing.

Anyway, I will provide some feedback on Thomas' latest revision of
today, before I bow out. I owe him at least that much.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Andres Freund
Date:
On 2017-02-10 07:52:57 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
> > Up until two minutes ago I assumed that policy would leave only two
> > possibilities: you attach to the DSM segment and attach to the
> > SharedBufFileManager successfully or you attach to the DSM segment and
> > then die horribly (but not throw) and the postmaster restarts the
> > whole cluster and blows all temp files away with RemovePgTempFiles().
> > But I see now in the comment of that function that crash-induced
> > restarts don't call that because "someone might want to examine the
> > temp files for debugging purposes".  Given that policy for regular
> > private BufFiles, I don't see why that shouldn't apply equally to
> > shared files: after a crash restart, you may have some junk files that
> > won't be cleaned up until your next clean restart, whether they were
> > private or shared BufFiles.
> 
> I think most people (other than Tom) would agree that that policy
> isn't really sensible any more; it probably made sense when the
> PostgreSQL user community was much smaller and consisted mostly of the
> people developing PostgreSQL, but these days it's much more likely to
> cause operational headaches than to help a developer debug.

FWIW, we have restart_after_crash = false. If you need to debug things,
you can enable that. Hence the whole RemovePgTempFiles() crash-restart
exemption isn't required anymore, we have a much more targeted solution.

- Andres



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:


On Wed, Mar 22, 2017 at 3:19 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Mar 22, 2017 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I disagree with that. It is a
>> trade-off, I suppose. I have now run out of time to work through it
>> with you or Thomas, though.
>
> Bummer.

I'm going to experiment with refactoring the v10 parallel CREATE INDEX
patch to use the SharedBufFileSet interface from
hj-shared-buf-file-v8.patch today and see what problems I run into.


As per the earlier discussion in the thread, I did experiment using
BufFileSet interface from parallel-hash-v18.patchset.  I took the reference
of parallel-hash other patches to understand the BufFileSet APIs, and
incorporate the changes to parallel create index.

In order to achieve the same:

- Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
parallel-hash-v18.patchset.
- Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
INDEX v10 patch.
- incorporate the BufFileSet API to the parallel tuple sort for CREATE INDEX.
- Changes into few existing functions as well as added few to support the
BufFileSet changes.

To check the performance, I used the similar test which Peter posted in
his earlier thread. which is:

Machine: power2 machine with 512GB of RAM

Setup:

CREATE TABLE parallel_sort_test AS
    SELECT hashint8(i) randint,
    md5(i::text) collate "C" padding1,
    md5(i::text || '2') collate "C" padding2
    FROM generate_series(0, 1e9::bigint) i;

vacuum ANALYZE parallel_sort_test;

postgres=# show max_parallel_workers_per_gather;
 max_parallel_workers_per_gather
---------------------------------
 8
(1 row)

postgres=# show maintenance_work_mem;
 maintenance_work_mem
----------------------
 8GB
(1 row)

postgres=# show max_wal_size ;
 max_wal_size
--------------
 4GB
(1 row)

CREATE INDEX serial_idx ON parallel_sort_test (randint);

Without patch:

Time: 3430054.220 ms (57:10.054)

With patch (max_parallel_workers_maintenance  = 8):

Time: 1163445.271 ms (19:23.445)

Thanks to my colleague Thomas Munro for his help and off-line discussion
for the patch.

Attaching v11 patch and trace_sort output for the test.
 
Thanks,
Rushabh Lathia
Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> As per the earlier discussion in the thread, I did experiment using
> BufFileSet interface from parallel-hash-v18.patchset.  I took the reference
> of parallel-hash other patches to understand the BufFileSet APIs, and
> incorporate the changes to parallel create index.
>
> In order to achieve the same:
>
> - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> parallel-hash-v18.patchset.
> - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> INDEX v10 patch.
> - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> INDEX.
> - Changes into few existing functions as well as added few to support the
> BufFileSet changes.

I'm glad that somebody is working on this. (Someone closer to the more
general work on shared/parallel BufFile infrastructure than I am.)

I do have some quick feedback, and I hope to be able to provide that
to both you and Thomas, as needed to see this one through. I'm not
going to get into the tricky details around resource management just
yet. I'll start with some simpler questions, to get a general sense of
the plan here.

I gather that you're at least aware that your v11 of the patch doesn't
preserve randomAccess support for parallel sorts, because you didn't
include my 0002-* testing GUCs patch, which was specifically designed
to make various randomAccess stuff testable. I also figured this to be
true because I noticed this FIXME among (otherwise unchanged)
tuplesort code:

> +static void
> +leader_takeover_tapes(Tuplesortstate *state)
> +{
> +   Sharedsort *shared = state->shared;
> +   int         nLaunched = state->nLaunched;
> +   int         j;
> +
> +   Assert(LEADER(state));
> +   Assert(nLaunched >= 1);
> +   Assert(nLaunched == shared->workersFinished);
> +
> +   /*
> +    * Create the tapeset from worker tapes, including a leader-owned tape at
> +    * the end.  Parallel workers are far more expensive than logical tapes,
> +    * so the number of tapes allocated here should never be excessive. FIXME
> +    */
> +   inittapestate(state, nLaunched + 1);
> +   state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> +                                         state->fileset, state->worker);

It's not surprising to me that you do not yet have this part working,
because much of my design was about changing as little as possible
above the BufFile interface, in order for tuplesort.c (and logtape.c)
code like this to "just work" as if it was the serial case. It doesn't
look like you've added the kind of BufFile multiplexing code that I
expected to see in logtape.c. This is needed to compensate for the
code removed from fd.c and buffile.c. Perhaps it would help me to go
look at Thomas' latest parallel hash join patch -- did it gain some
kind of transparent multiplexing ability that you actually (want to)
use here?

Though randomAccess isn't used by CREATE INDEX in general, and so not
supporting randomAccess within tuplesort.c for parallel callers
doesn't matter as far as this CREATE INDEX user-visible feature is
concerned, I still believe that randomAccess is important (IIRC,
Robert thought so too). Specifically, it seems like a good idea to
have randomAccess support, both on general principle (why should the
parallel case be different?), and because having it now will probably
enable future enhancements to logtape.c. Enhancements that have it
manage parallel sorts based on partitioning/distribution/bucketing
[1]. I'm pretty sure that partitioning-based parallel sort is going to
become very important in the future, especially for parallel
GroupAggregate. The leader needs to truly own the tapes it reclaims
from workers in order for all of this to work.

Questions on where you're going with randomAccess support:

1. Is randomAccess support a goal for you here at all?

2. If so, is preserving eager recycling of temp file space during
randomAccess (materializing a final output tape within the leader)
another goal for you here? Do we need to preserve that property of
serial external sorts, too, so that it remains true that logtape.c
ensures that "the total space usage is essentially just the actual
data volume, plus insignificant bookkeeping and start/stop overhead"?
(I'm quoting from master's logtape.c header comments.)

3. Any ideas on next steps in support of those 2 goals? What problems
do you foresee, if any?

> CREATE INDEX serial_idx ON parallel_sort_test (randint);
>
> Without patch:
>
> Time: 3430054.220 ms (57:10.054)
>
> With patch (max_parallel_workers_maintenance  = 8):
>
> Time: 1163445.271 ms (19:23.445)

This looks very similar to my v10. While I will need to follow up on
this, to make sure, it seems likely that this patch has exactly the
same performance characteristics as v10.

Thanks

[1]
https://wiki.postgresql.org/wiki/Parallel_External_Sort#Partitioning_for_parallelism_.28parallel_external_sort_beyond_CREATE_INDEX.29
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:


On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> As per the earlier discussion in the thread, I did experiment using
> BufFileSet interface from parallel-hash-v18.patchset.  I took the reference
> of parallel-hash other patches to understand the BufFileSet APIs, and
> incorporate the changes to parallel create index.
>
> In order to achieve the same:
>
> - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> parallel-hash-v18.patchset.
> - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> INDEX v10 patch.
> - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> INDEX.
> - Changes into few existing functions as well as added few to support the
> BufFileSet changes.

I'm glad that somebody is working on this. (Someone closer to the more
general work on shared/parallel BufFile infrastructure than I am.)

I do have some quick feedback, and I hope to be able to provide that
to both you and Thomas, as needed to see this one through. I'm not
going to get into the tricky details around resource management just
yet. I'll start with some simpler questions, to get a general sense of
the plan here.


Thanks Peter.
 
I gather that you're at least aware that your v11 of the patch doesn't
preserve randomAccess support for parallel sorts, because you didn't
include my 0002-* testing GUCs patch, which was specifically designed
to make various randomAccess stuff testable. I also figured this to be
true because I noticed this FIXME among (otherwise unchanged)
tuplesort code:


Yes, I haven't touched the randomAccess part yet. My initial goal was
to incorporate the BufFileSet api's here.
 
> +static void
> +leader_takeover_tapes(Tuplesortstate *state)
> +{
> +   Sharedsort *shared = state->shared;
> +   int         nLaunched = state->nLaunched;
> +   int         j;
> +
> +   Assert(LEADER(state));
> +   Assert(nLaunched >= 1);
> +   Assert(nLaunched == shared->workersFinished);
> +
> +   /*
> +    * Create the tapeset from worker tapes, including a leader-owned tape at
> +    * the end.  Parallel workers are far more expensive than logical tapes,
> +    * so the number of tapes allocated here should never be excessive. FIXME
> +    */
> +   inittapestate(state, nLaunched + 1);
> +   state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> +                                         state->fileset, state->worker);

It's not surprising to me that you do not yet have this part working,
because much of my design was about changing as little as possible
above the BufFile interface, in order for tuplesort.c (and logtape.c)
code like this to "just work" as if it was the serial case.

Right. I just followed your design in the your earlier patches.
 
It doesn't
look like you've added the kind of BufFile multiplexing code that I
expected to see in logtape.c. This is needed to compensate for the
code removed from fd.c and buffile.c. Perhaps it would help me to go
look at Thomas' latest parallel hash join patch -- did it gain some
kind of transparent multiplexing ability that you actually (want to)
use here?


Sorry, I didn't get this part. Are you talking about the your patch changes
into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
related to ltsUnify() ?  If that's the case, I don't think it require with the
BufFileSet. Correct me if I am wrong here.

Though randomAccess isn't used by CREATE INDEX in general, and so not
supporting randomAccess within tuplesort.c for parallel callers
doesn't matter as far as this CREATE INDEX user-visible feature is
concerned, I still believe that randomAccess is important (IIRC,
Robert thought so too). Specifically, it seems like a good idea to
have randomAccess support, both on general principle (why should the
parallel case be different?), and because having it now will probably
enable future enhancements to logtape.c. Enhancements that have it
manage parallel sorts based on partitioning/distribution/bucketing
[1]. I'm pretty sure that partitioning-based parallel sort is going to
become very important in the future, especially for parallel
GroupAggregate. The leader needs to truly own the tapes it reclaims
from workers in order for all of this to work.


First application for the tuplesort here is CREATE INDEX and that doesn't
need randomAccess. But as you said and in the thread its been discussed,
randomAccess is an important and we should sure put an efforts to support
the same.

Questions on where you're going with randomAccess support:

1. Is randomAccess support a goal for you here at all?

2. If so, is preserving eager recycling of temp file space during
randomAccess (materializing a final output tape within the leader)
another goal for you here? Do we need to preserve that property of
serial external sorts, too, so that it remains true that logtape.c
ensures that "the total space usage is essentially just the actual
data volume, plus insignificant bookkeeping and start/stop overhead"?
(I'm quoting from master's logtape.c header comments.)

3. Any ideas on next steps in support of those 2 goals? What problems
do you foresee, if any?


To be frank its too early for me to comment anything in this area.  I need
to study this more closely. As an initial goal I was just focused on
understanding the current implementation of the patch and incorporate
the BufFileSet APIs.
 
> CREATE INDEX serial_idx ON parallel_sort_test (randint);
>
> Without patch:
>
> Time: 3430054.220 ms (57:10.054)
>
> With patch (max_parallel_workers_maintenance  = 8):
>
> Time: 1163445.271 ms (19:23.445)

This looks very similar to my v10. While I will need to follow up on
this, to make sure, it seems likely that this patch has exactly the
same performance characteristics as v10.


Its 2.96x, more or less similar to your v10.  Might be some changes due
to different testing environment.
 
Thanks

[1] https://wiki.postgresql.org/wiki/Parallel_External_Sort#Partitioning_for_parallelism_.28parallel_external_sort_beyond_CREATE_INDEX.29
--
Peter Geoghegan



Thanks,
Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Sep 20, 2017 at 5:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> First application for the tuplesort here is CREATE INDEX and that doesn't
> need randomAccess. But as you said and in the thread its been discussed,
> randomAccess is an important and we should sure put an efforts to support
> the same.

There's no direct benefit of working on randomAccess support unless we
have some code that wants to use that support for something.  Indeed,
it would just leave us with code we couldn't test.

While I do agree that there are probably use cases for randomAccess, I
think what we should do right now is try to get this patch reviewed
and committed so that we have parallel CREATE INDEX for btree indexes.
And in so doing, let's keep it as simple as possible.  Parallel CREATE
INDEX for btree indexes is a great feature without adding any more
complexity.

Later, anybody who wants to work on randomAccess support -- and
whatever planner and executor changes are needed to make effective use
of it -- can do so.  For example, one can imagine a plan like this:

Gather
-> Merge Join -> Parallel Index Scan -> Parallel Sort   -> Parallel Seq Scan

If the parallel sort reads out all of the output in every worker, then
it becomes legal to do this kind of thing -- it would end up, I think,
being quite similar to Parallel Hash.  However, there's some question
in my mind as to whether want to do this or, say, hash-partition both
relations and then perform separate joins on each partition.  The
above plan is clearly better than what we can do today, where every
worker would have to repeat the sort, ugh, but I don't know if it's
the best plan.  Fortunately, to get this patch committed, we don't
have to figure that out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.

This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.

> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
> related to ltsUnify() ?  If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.

I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.

> To be frank its too early for me to comment anything in this area.  I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.

Fair enough.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:


On Sat, Sep 30, 2017 at 5:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.

This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.


Sorry for delay, here is rebase version of patch.
 
> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
> related to ltsUnify() ?  If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.

I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.


Hmm okay.
 
> To be frank its too early for me to comment anything in this area.  I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.

Fair enough.


Thanks,

--
Rushabh Lathia
Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:
Attaching the re based patch according to the v22 parallel-hash patch sets

Thanks

On Tue, Oct 10, 2017 at 2:53 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:


On Sat, Sep 30, 2017 at 5:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.

This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.


Sorry for delay, here is rebase version of patch.
 
> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
> related to ltsUnify() ?  If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.

I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.


Hmm okay.
 
> To be frank its too early for me to comment anything in this area.  I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.

Fair enough.


Thanks,

--
Rushabh Lathia



--
Rushabh Lathia
Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attaching the re based patch according to the v22 parallel-hash patch sets

I took a quick look at this today, and noticed a few issues:

* make_name() is used to name files in sharedtuplestore.c, which is
what is passed to BufFileOpenShared() for parallel hash join. Your
using your own logic for that within the equivalent logtape.c call to
BufFileOpenShared(), presumably because make_name() wants to identify
participants by PID rather than by an ordinal identifier number.

I think that we need some kind of central registry for things that use
shared buffiles. It could be that sharedtuplestore.c is further
generalized to support this, or it could be that they both call
something else that takes care of naming. It's not okay to have this
left to random chance.

You're going to have to ask Thomas about this. You should also use
MAXPGPATH for the char buffer on the stack.

* This logtape.c comment needs to be updated, as it's no longer true:
* successfully.  In general, workers can take it that the leader will* reclaim space in files under their ownership,
andso should not* reread from tape.
 

* Robert hated the comment changes in the header of nbtsort.c. You
might want to change it back, because he is likely to be the one that
commits this.

* You should look for similar comments in tuplesort.c (IIRC a couple
of places will need to be revised).

* tuplesort_begin_common() should actively reject a randomAccess
parallel case using elog(ERROR).

* tuplesort.h should note that randomAccess isn't supported, too.

* What's this all about?:

+ /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
+ #define GetSharedBufFileSet(shared)                    \
+   ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

You can't just cast from one type to the other without regard for the
underling size of the shared memory buffer, which is what this looks
like to me. This only fails to crash because you're only abusing the
last member in the tapes array for this purpose, and there happens to
be enough shared memory slop that you get away with it. I'm pretty
sure that ltsUnify() ends up clobbering the last/leader tape, which is
a place where BufFileSet is also used, so this is just wrong. You
should rethink the shmem structure a little bit.

* There is still that FIXME comment within leader_takeover_tapes(). I
believe that you should still have a leader tape (at least in local
memory in the leader), even though you'll never be able to do anything
with it, since randomAccess is no longer supported. You can remove the
FIXME, and just note that you have a leader tape to be consistent with
the serial case, though recognize that it's not useful. Note that even
with randomAccess, we always had the leader tape, so it's not that
different, really.

I suppose it might make sense to make shared->tapes not have a leader
tape. It hardly matters -- perhaps you should leave it there in order
to keep the code simple, as you'll be keeping the leader tape in local
memory, too. (But it still won't fly to continue to clobber it, of
course -- you still need to find a dedicated place for BufFileSet in
shared memory.)

That's all I have right now.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Nov 1, 2017 at 11:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
> <rushabh.lathia@gmail.com> wrote:
>> Attaching the re based patch according to the v22 parallel-hash patch sets
>
> I took a quick look at this today, and noticed a few issues:
>
> * make_name() is used to name files in sharedtuplestore.c, which is
> what is passed to BufFileOpenShared() for parallel hash join. Your
> using your own logic for that within the equivalent logtape.c call to
> BufFileOpenShared(), presumably because make_name() wants to identify
> participants by PID rather than by an ordinal identifier number.

So that's this bit:

+ pg_itoa(worker, filename);
+ lts->pfile = BufFileCreateShared(fileset, filename);

... and:

+ pg_itoa(i, filename);
+ file = BufFileOpenShared(fileset, filename);

What's wrong with using a worker number like this?

> I think that we need some kind of central registry for things that use
> shared buffiles. It could be that sharedtuplestore.c is further
> generalized to support this, or it could be that they both call
> something else that takes care of naming. It's not okay to have this
> left to random chance.

It's not random choice: buffile.c creates a uniquely named directory
(or directories, if you have more than one location configured in the
temp_tablespaces GUC) to hold all the backing files involved in each
BufFileSet.  Naming of BufFiles within the BufFileSet is the caller's
problem, and a worker number seems like a reasonable choice to me.  It
won't collide with a concurrent parallel CREATE INDEX because that'll
be using its own BufFileSet.

> You're going to have to ask Thomas about this.  You should also use
> MAXPGPATH for the char buffer on the stack.

Here's a summary of namespace management scheme I currently have at
the three layers fd.c, buffile.c, sharedtuplestore.c:

1.  fd.c has new lower-level functions provides
PathNameCreateTemporaryFile(const char *path) and
PathNameOpenTemporaryFile(const char *path).  It also provides
PathNameCreateTemporaryDir().  Clearly callers of these interfaces
will need to be very careful about managing the names they use.
Callers also own the problem of cleaning up files, since there is no
automatic cleanup of files created this way.  My intention was that
these facilities would *only* be used by BufFileSet, since it has
machinery to manage those things.

2.  buffile.c introduces BufFileSet, which is conceptually a set of
BufFiles that can be shared by multiple backends with DSM
segment-scoped cleanup.  It is implemented as a set of directories:
one for each tablespace in temp_tablespaces.  It controls the naming
of those directories.  The BufFileSet directories are named similarly
to fd.c's traditional temporary file names using the usual recipe
"pgsql_tmp" + PID + per-process counter but have an additional ".set"
suffix.  RemovePgTempFilesInDir() recognises directories with that
prefix and suffix as junk left over from a crash when cleaning up.  I
suppose it's that knowledge about reserved name patterns and cleanup
that you are thinking of as a central registry?  As for the BufFiles
that are in a BufFileSet, buffile.c has no opinion on that: the
calling code (parallel CREATE INDEX, sharedtuplestore.c, ...) is
responsible for coming up with its own scheme.  If parallel CREATE
INDEX wants to name shared BufFiles "walrus" and "banana", that's OK
by me, and those files won't collide with anything in another
BufFileSet because each BufFileSet has its own directory (-ies).

One complaint about the current coding that someone might object to:
MakeSharedSegmentPath() just dumps the caller's BufFile name into a
path without sanitisation: I should fix that so that we only accept
fairly limited strings here.  Another complaint is that perhaps fd.c
knows too much about buffile.c's business.  For example,
RemovePgTempFilesInDir() knows about the ".set" directories created by
buffile.c, which might be called a layering violation.  Perhaps the
set/directory logic should move entirely into fd.c, so you'd call
FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
Thoughts?

3.  sharedtuplestore.c takes a caller-supplied BufFileSet and creates
its shared BufFiles in there.  Earlier versions created and owned a
BufFileSet, but in the current Parallel Hash patch I create loads of
separate SharedTuplestore objects but I didn't want to create load of
directories to back them, so you can give them all the same
BufFileSet.  That works because SharedTuplestores are also given a
name, and it's the caller's job (in my case nodeHash.c) to make sure
the SharedTuplestores are given unique names within the same
BufFileSet.  For Parallel Hash you'll see names like 'i3of8' (inner
batch 3 of 8).  There is no need for there to be in any sort of
central registry for that though, because it rides on top of the
guarantees from 2 above: buffile.c will put those files into a
uniquely named directory, and that works as long as no one else is
allowed to create files or directories in the temp directory that
collide with its reserved pattern /^pgsql_tmp.+\.set$/.  For the same
reason, parallel CREATE INDEX is free to use worker numbers as BufFile
names, since it has its own BufFileSet to work within.

> * What's this all about?:
>
> + /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
> + #define GetSharedBufFileSet(shared)                    \
> +   ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

In an earlier version, BufFileSet was one of those annoying data
structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an
incomplete type (declared but not defined in the includable header),
and here it was being used "inside" (or rather after) SharedSort,
which *itself* had a FLEXIBLE_ARRAY_MEMBER.  The reason for the
variable sized object was that I needed all backends to agree on the
set of temporary tablespace OIDs, of which there could be any number,
but I also needed a 'flat' (pointer-free) object I could stick in
relocatable shared memory.  In the newest version I changed that
flexible array to tablespaces[8], because 8 should be enough
tablespaces for anyone (TM).  I don't really believe anyone uses
temp_tablespaces for IO load balancing anymore and I hate code like
the above.  So I think Rushabh should now remove the above-quoted code
and just use a BufFileSet directly as a member of SharedSort.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> So that's this bit:
>
> + pg_itoa(worker, filename);
> + lts->pfile = BufFileCreateShared(fileset, filename);
>
> ... and:
>
> + pg_itoa(i, filename);
> + file = BufFileOpenShared(fileset, filename);

Right.

> What's wrong with using a worker number like this?

I guess nothing, though there is the question of discoverability for
DBAs, etc. You do address this separately, by having (potentially)
descriptive filenames, as you go into.

> It's not random choice: buffile.c creates a uniquely named directory
> (or directories, if you have more than one location configured in the
> temp_tablespaces GUC) to hold all the backing files involved in each
> BufFileSet.  Naming of BufFiles within the BufFileSet is the caller's
> problem, and a worker number seems like a reasonable choice to me.  It
> won't collide with a concurrent parallel CREATE INDEX because that'll
> be using its own BufFileSet.

Oh, I see. I may have jumped the gun on that one.

> One complaint about the current coding that someone might object to:
> MakeSharedSegmentPath() just dumps the caller's BufFile name into a
> path without sanitisation: I should fix that so that we only accept
> fairly limited strings here.  Another complaint is that perhaps fd.c
> knows too much about buffile.c's business.  For example,
> RemovePgTempFilesInDir() knows about the ".set" directories created by
> buffile.c, which might be called a layering violation.  Perhaps the
> set/directory logic should move entirely into fd.c, so you'd call
> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
> BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
> Thoughts?

I'm going to make an item on my personal TODO list for that. No useful
insights on that right now, though.

> 3.  sharedtuplestore.c takes a caller-supplied BufFileSet and creates
> its shared BufFiles in there.  Earlier versions created and owned a
> BufFileSet, but in the current Parallel Hash patch I create loads of
> separate SharedTuplestore objects but I didn't want to create load of
> directories to back them, so you can give them all the same
> BufFileSet.  That works because SharedTuplestores are also given a
> name, and it's the caller's job (in my case nodeHash.c) to make sure
> the SharedTuplestores are given unique names within the same
> BufFileSet.  For Parallel Hash you'll see names like 'i3of8' (inner
> batch 3 of 8).  There is no need for there to be in any sort of
> central registry for that though, because it rides on top of the
> guarantees from 2 above: buffile.c will put those files into a
> uniquely named directory, and that works as long as no one else is
> allowed to create files or directories in the temp directory that
> collide with its reserved pattern /^pgsql_tmp.+\.set$/.  For the same
> reason, parallel CREATE INDEX is free to use worker numbers as BufFile
> names, since it has its own BufFileSet to work within.

If the new standard is that you have temp file names that suggest the
purpose of each temp file, then that may be something that parallel
CREATE INDEX should buy into.

> In an earlier version, BufFileSet was one of those annoying data
> structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an
> incomplete type (declared but not defined in the includable header),
> and here it was being used "inside" (or rather after) SharedSort,
> which *itself* had a FLEXIBLE_ARRAY_MEMBER.  The reason for the
> variable sized object was that I needed all backends to agree on the
> set of temporary tablespace OIDs, of which there could be any number,
> but I also needed a 'flat' (pointer-free) object I could stick in
> relocatable shared memory.  In the newest version I changed that
> flexible array to tablespaces[8], because 8 should be enough
> tablespaces for anyone (TM).

I guess that that's something that you'll need to take up with Andres,
if you haven't already. I have a hard time imagining a single query
needed to use more than that many tablespaces at once, so maybe this
is fine.

> I don't really believe anyone uses
> temp_tablespaces for IO load balancing anymore and I hate code like
> the above.  So I think Rushabh should now remove the above-quoted code
> and just use a BufFileSet directly as a member of SharedSort.

FWIW, I agree with you that nobody uses temp_tablespaces this way
these days. This seems like a discussion for your hash join patch,
though. I'm happy to buy into that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Nov 1, 2017 at 2:11 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Another complaint is that perhaps fd.c
>> knows too much about buffile.c's business.  For example,
>> RemovePgTempFilesInDir() knows about the ".set" directories created by
>> buffile.c, which might be called a layering violation.  Perhaps the
>> set/directory logic should move entirely into fd.c, so you'd call
>> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
>> BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
>> Thoughts?
>
> I'm going to make an item on my personal TODO list for that. No useful
> insights on that right now, though.

I decided to try that, but it didn't really work: fd.h gets included
by front-end code, so I can't very well define a struct and declare
functions that deal in dsm_segment and slock_t.  On the other hand it
does seem a bit better to for these shared file sets to work in terms
of File, not BufFile.  That way you don't have to opt in to BufFile's
double buffering and segmentation schemes just to get shared file
clean-up, if for some reason you want direct file handles.  So I in
the v24 parallel hash patch set I just posted over in the other thread
I have moved it into its own translation unit sharedfileset.c and made
it work with File objects.  buffile.c knows how to use it as a source
of segment files.  I think that's better.

> If the new standard is that you have temp file names that suggest the
> purpose of each temp file, then that may be something that parallel
> CREATE INDEX should buy into.

Yeah, I guess that could be useful.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> I'm going to make an item on my personal TODO list for that. No useful
>> insights on that right now, though.
>
>I decided to try that, but it didn't really work: fd.h gets included
>by front-end code, so I can't very well define a struct and declare
>functions that deal in dsm_segment and slock_t.  On the other hand it
>does seem a bit better to for these shared file sets to work in terms
>of File, not BufFile.

Realistically, fd.h has a number of functions that are really owned by
buffile.c already. This sounds fine.

> That way you don't have to opt in to BufFile's
>double buffering and segmentation schemes just to get shared file
>clean-up, if for some reason you want direct file handles.

Is that something that you really think is possible?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Fri, Nov 3, 2017 at 2:24 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> That way you don't have to opt in to BufFile's
>> double buffering and segmentation schemes just to get shared file
>> clean-up, if for some reason you want direct file handles.
>
> Is that something that you really think is possible?

It's pretty far fetched, but maybe shared temporary relation files
accessed via smgr.c/md.c?  Or maybe future things that don't want to
read/write through a buffer but instead want to mmap it.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:
Thanks Peter and Thomas for the review comments.


On Wed, Nov 1, 2017 at 3:59 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attaching the re based patch according to the v22 parallel-hash patch sets

I took a quick look at this today, and noticed a few issues:

* make_name() is used to name files in sharedtuplestore.c, which is
what is passed to BufFileOpenShared() for parallel hash join. Your
using your own logic for that within the equivalent logtape.c call to
BufFileOpenShared(), presumably because make_name() wants to identify
participants by PID rather than by an ordinal identifier number.

I think that we need some kind of central registry for things that use
shared buffiles. It could be that sharedtuplestore.c is further
generalized to support this, or it could be that they both call
something else that takes care of naming. It's not okay to have this
left to random chance.

You're going to have to ask Thomas about this. You should also use
MAXPGPATH for the char buffer on the stack.


Used MAXPGPATH for the char buffer.
 
* This logtape.c comment needs to be updated, as it's no longer true:
 
 * successfully.  In general, workers can take it that the leader will
 * reclaim space in files under their ownership, and so should not
 * reread from tape.


Done.
 
* Robert hated the comment changes in the header of nbtsort.c. You
might want to change it back, because he is likely to be the one that
commits this.
 
* You should look for similar comments in tuplesort.c (IIRC a couple
of places will need to be revised).


Pending.
 
* tuplesort_begin_common() should actively reject a randomAccess
parallel case using elog(ERROR).


Done.
 
* tuplesort.h should note that randomAccess isn't supported, too.


Done.
 
* What's this all about?:

+ /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
+ #define GetSharedBufFileSet(shared)                    \
+   ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

You can't just cast from one type to the other without regard for the
underling size of the shared memory buffer, which is what this looks
like to me. This only fails to crash because you're only abusing the
last member in the tapes array for this purpose, and there happens to
be enough shared memory slop that you get away with it. I'm pretty
sure that ltsUnify() ends up clobbering the last/leader tape, which is
a place where BufFileSet is also used, so this is just wrong. You
should rethink the shmem structure a little bit.


Fixed this by adding a SharedFileSet directly into the Sharedsort struct.

Thanks Thomas Munro for the offline help here.

* There is still that FIXME comment within leader_takeover_tapes(). I
believe that you should still have a leader tape (at least in local
memory in the leader), even though you'll never be able to do anything
with it, since randomAccess is no longer supported. You can remove the
FIXME, and just note that you have a leader tape to be consistent with
the serial case, though recognize that it's not useful. Note that even
with randomAccess, we always had the leader tape, so it's not that
different, really.


Done.
 
I suppose it might make sense to make shared->tapes not have a leader
tape. It hardly matters -- perhaps you should leave it there in order
to keep the code simple, as you'll be keeping the leader tape in local
memory, too. (But it still won't fly to continue to clobber it, of
course -- you still need to find a dedicated place for BufFileSet in
shared memory.)


Attaching the latest patch (v13) here and I will continue working on the comment
improvement part for nbtsort.c and tuplesort.c.  Also will perform more testing
with the attached patch.

Patch is re-base of v25 patch set of Parallel hash.


Thanks,
Rushabh Lathia
Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Thanks Peter and Thomas for the review comments.

No problem. More feedback:

* I don't really see much need for this:

+ elog(LOG, "Worker for create index %d", parallel_workers);

You can just use trace_sort, and observe the actual behavior of the
sort that way.

* As I said before, you should remove the header comments within nbtsort.c.

* This should just say "write routines":

+ * This is why write/recycle routines don't need to know about offsets at
+ * all.

* You didn't point out the randomAccess restriction in tuplesort.h.

* I can't remember why I added the Valgrind suppression at this point.
I'd remove it until the reason becomes clear, which may never happen.
The regression tests should still pass without Valgrind warnings.

* You can add back comments removed from above LogicalTapeTell(). I
made these changes because it looked like we should close out the
possibility of doing a tell during the write phase, as unified tapes
actually would make that hard (no one does what it describes anyway).
But now, unified tapes are a distinct case to frozen tapes in a way
that they weren't before, so there is no need to make it impossible.

I also think you should replace "Assert(lt->frozen)" with
"Assert(lt->offsetBlockNumber == 0L)", for the same reason.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Nov 14, 2017 at 10:01 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia
> <rushabh.lathia@gmail.com> wrote:
>> Thanks Peter and Thomas for the review comments.
>
> No problem. More feedback:

I see that Robert just committed support for a
parallel_leader_participation GUC. Parallel tuplesort should use this,
too.

It will be easy to adopt the patch to make this work. Just change the
code within nbtsort.c to respect parallel_leader_participation, rather
than leaving that as a compile-time switch. Remove the
force_single_worker variable, and use !parallel_leader_participation
in its place.

The parallel_leader_participation docs will also need to be updated.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:

Sorry for the delay in the another version of patch.

On Tue, Nov 14, 2017 at 11:31 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Thanks Peter and Thomas for the review comments.

No problem. More feedback:

* I don't really see much need for this:

+ elog(LOG, "Worker for create index %d", parallel_workers);

You can just use trace_sort, and observe the actual behavior of the
sort that way.


Right, that was just added for the testing purposed. Removed in the
latest version of the patch.
 
* As I said before, you should remove the header comments within nbtsort.c.


Done.
 
* This should just say "write routines":

+ * This is why write/recycle routines don't need to know about offsets at
+ * all.


Okay, done.
 
* You didn't point out the randomAccess restriction in tuplesort.h.


I did, it's there in the file header comments.
 
* I can't remember why I added the Valgrind suppression at this point.
I'd remove it until the reason becomes clear, which may never happen.
The regression tests should still pass without Valgrind warnings.


Make sense.
 
* You can add back comments removed from above LogicalTapeTell(). I
made these changes because it looked like we should close out the
possibility of doing a tell during the write phase, as unified tapes
actually would make that hard (no one does what it describes anyway).
But now, unified tapes are a distinct case to frozen tapes in a way
that they weren't before, so there is no need to make it impossible.

I also think you should replace "Assert(lt->frozen)" with
"Assert(lt->offsetBlockNumber == 0L)", for the same reason.


Yep, done.


I see that Robert just committed support for a
parallel_leader_participation GUC. Parallel tuplesort should use this,
too.

It will be easy to adopt the patch to make this work. Just change the
code within nbtsort.c to respect parallel_leader_participation, rather
than leaving that as a compile-time switch. Remove the
force_single_worker variable, and use !parallel_leader_participation
in its place.


Added handling for parallel_leader_participation as well as deleted
compile time option force_single_worker.
 
The parallel_leader_participation docs will also need to be updated.


Done.


Also performed more testing with the patch, with parallel_leader_participation
ON and OFF.  Found one issue, where earlier we always used to call
_bt_leader_sort_as_worker() but now need to skip the call if parallel_leader_participation
is OFF.

Also fixed the documentation and the compilation error for the documentation.

PFA v14 patch.


...
Thanks,
Rushabh Lathia
Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

Cool. I'm glad that we now have a patch that applies cleanly against
master, while adding very little to buffile.c. It feels like we're
getting very close here.

>> * You didn't point out the randomAccess restriction in tuplesort.h.
>>
>
> I did, it's there in the file header comments.

I see what you wrote in tuplesort.h here:

> + * algorithm, and are typically only used for large amounts of data. Note
> + * that parallel sorts is not support for random access to the sort result.

This should say "...are not support when random access is requested".

> Added handling for parallel_leader_participation as well as deleted
> compile time option force_single_worker.

I still see this:

> +
> +/*
> + * A parallel sort with one worker process, and without any leader-as-worker
> + * state may be used for testing the parallel tuplesort infrastructure.
> + */
> +#ifdef NOT_USED
> +#define FORCE_SINGLE_WORKER
> +#endif

Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it, too.

>> The parallel_leader_participation docs will also need to be updated.
>>
>
> Done.

I don't see this. There is no reference to
parallel_leader_participation in the CREATE INDEX docs, nor is there a
reference to CREATE INDEX in the parallel_leader_participation docs.

> Also performed more testing with the patch, with
> parallel_leader_participation
> ON and OFF.  Found one issue, where earlier we always used to call
> _bt_leader_sort_as_worker() but now need to skip the call if
> parallel_leader_participation
> is OFF.

Hmm. I think the local variable within _bt_heapscan() should go back.
Its value should be directly taken from parallel_leader_participation
assignment, once. There might be some bizarre circumstances where it
is possible for the value of parallel_leader_participation to change
in flight, causing a race condition: we start with the leader as a
participant, and change our mind later within
_bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang
forever.

Even if that's impossible, it seems like an improvement in style to go
back to one local variable controlling everything.

Style issue here:

> + long start_block = file->numFiles * BUFFILE_SEG_SIZE;
> + int newNumFiles = file->numFiles + source->numFiles;

Shouldn't start_block conform to the surrounding camelCase style?

Finally, two new thoughts on the patch, that are not responses to
anything you did in v14:

1. Thomas' barrier abstraction was added by commit 1145acc7. I think
that you should use a static barrier in tuplesort.c now, and rip out
the ConditionVariable fields in the Sharedsort struct. It's only a
slightly higher level of abstraction for tuplesort.c, which makes only
a small difference given the simple requirements of tuplesort.c.
However, I see no reason to not go that way if that's the new
standard, which it is. This looks like it will be fairly easy.

2. Does the plan_create_index_workers() cost model need to account for
parallel_leader_participation, too, when capping workers? I think that
it does.

The relevant planner code is:

> +   /*
> +    * Cap workers based on available maintenance_work_mem as needed.
> +    *
> +    * Note that each tuplesort participant receives an even share of the
> +    * total maintenance_work_mem budget.  Aim to leave workers (where
> +    * leader-as-worker Tuplesortstate counts as a worker) with no less than
> +    * 32MB of memory.  This leaves cases where maintenance_work_mem is set to
> +    * 64MB immediately past the threshold of being capable of launching a
> +    * single parallel worker to sort.
> +    */
> +   sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ;
> +   min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ;
> +   while (parallel_workers > min_parallel_workers &&
> +          sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks)
> +       parallel_workers--;

This parallel CREATE INDEX planner code snippet is about the need to
have low per-worker maintenance_work_mem availability prevent more
parallel workers from being added to the number that we plan to
launch. Each worker tuplesort state needs at least 32MB. We clearly
need to do something here.

While it's always true that "leader-as-worker Tuplesortstate counts as
a worker" in v14, I think that it should only be true in the next
revision of the patch when parallel_leader_participation is actually
true (IOW, we should only add 1 to parallel_workers within the loop
invariant in that case). The reason why we need to consider
parallel_leader_participation within this plan_create_index_workers()
code is simple: During execution, _bt_leader_sort_as_worker() uses
"worker tuplesort states"/btshared->scantuplesortstates to determine
how much of a share of maintenance_work_mem each worker tuplesort
gets. Our planner code needs to take that into account, now that the
nbtsort.c parallel_leader_participation behavior isn't just some
obscure debug option. IOW, the planner code needs to be consistent
with the nbtsort.c execution code.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Fri, Dec 8, 2017 at 1:57 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
> that you should use a static barrier in tuplesort.c now, and rip out
> the ConditionVariable fields in the Sharedsort struct. It's only a
> slightly higher level of abstraction for tuplesort.c, which makes only
> a small difference given the simple requirements of tuplesort.c.
> However, I see no reason to not go that way if that's the new
> standard, which it is. This looks like it will be fairly easy.

I thought about this too.  A static barrier seems ideal for it, except
for one tiny detail.  We'd initialise the barrier with the number of
participants, and then after launching we get to find out how many
workers were really launched using pcxt->nworkers_launched, which may
be a smaller number.  If it's a smaller number, we need to adjust the
barrier to the smaller party size.  We can't do that by calling
BarrierDetach() n times, because Andres convinced me to assert that
you didn't try to detach from a static barrier (entirely reasonably)
and I don't really want a process to be 'detaching' on behalf of
someone else anyway.  So I think we'd need to add an extra barrier
function that lets you change the party size of a static barrier.
Yeah, that sounds like a contradiction...  but it's not the same as
the attach/detach workflow because static parties *start out
attached*, which is a very important distinction (it means that client
code doesn't have to futz about with phases, or in other words the
worker doesn't have to consider the possibility that it started up
late and missed all the action and the sort is finished).  The tidiest
way to provide this new API would, I think, be to change the internal
function BarrierDetachImpl() to take a parameter n and reduce
barrier->participants by that number, and then add a function
BarrierForgetParticipants(barrier, n) [insert better name] and have it
call BarrierDetachImpl().  Then the latter's assertion that
!static_party could move out to BarrierDetach() and
BarrierArriveAndDetach().  Alternatively, we could use the dynamic API
(see earlier parentheses about phases).

The end goal would be that code like this can use
BarrierInit(&barrier, participants), then (if necessary)
BarrierForgetParticipants(&barrier, nonstarters), and then they all
just have to call BarrierArriveAndWait() at the right time and that's
all.  Nice and tidy.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Fri, Dec 8, 2017 at 2:23 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Dec 8, 2017 at 1:57 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
>> that you should use a static barrier in tuplesort.c now, and rip out
>> the ConditionVariable fields in the Sharedsort struct.
>
> ... So I think we'd need to add an extra barrier
> function that lets you change the party size of a static barrier.

Something like the attached (untested), which would allow
_bt_begin_parallel() to call BarrierInit(&barrier, request + 1), then
BarrierForgetParticipants(&barrier, request -
pcxt->nworkers_launched), and then all the condition variable loop
stuff can be replaced with a well placed call to
BarrierArriveAndWait(&barrier, WAIT_EVENT_SOMETHING_SOMETHING).

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:
Thanks for review.

On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

Cool. I'm glad that we now have a patch that applies cleanly against
master, while adding very little to buffile.c. It feels like we're
getting very close here.

>> * You didn't point out the randomAccess restriction in tuplesort.h.
>>
>
> I did, it's there in the file header comments.

I see what you wrote in tuplesort.h here:

> + * algorithm, and are typically only used for large amounts of data. Note
> + * that parallel sorts is not support for random access to the sort result.

This should say "...are not support when random access is requested".


Done.
 
> Added handling for parallel_leader_participation as well as deleted
> compile time option force_single_worker.

I still see this:

> +
> +/*
> + * A parallel sort with one worker process, and without any leader-as-worker
> + * state may be used for testing the parallel tuplesort infrastructure.
> + */
> +#ifdef NOT_USED
> +#define FORCE_SINGLE_WORKER
> +#endif

Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it, too.


Done.
 
>> The parallel_leader_participation docs will also need to be updated.
>>
>
> Done.

I don't see this. There is no reference to
parallel_leader_participation in the CREATE INDEX docs, nor is there a
reference to CREATE INDEX in the parallel_leader_participation docs.


I thought parallel_leader_participation is generic GUC which get effect
for all parallel operation. isn't it?  On that understanding I just update the
documentation of parallel_leader_participation into config.sgml to
make it more generalize.

 
> Also performed more testing with the patch, with
> parallel_leader_participation
> ON and OFF.  Found one issue, where earlier we always used to call
> _bt_leader_sort_as_worker() but now need to skip the call if
> parallel_leader_participation
> is OFF.

Hmm. I think the local variable within _bt_heapscan() should go back.
Its value should be directly taken from parallel_leader_participation
assignment, once. There might be some bizarre circumstances where it
is possible for the value of parallel_leader_participation to change
in flight, causing a race condition: we start with the leader as a
participant, and change our mind later within
_bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang
forever.

Even if that's impossible, it seems like an improvement in style to go
back to one local variable controlling everything.


Yes, to me also it's looks kind of impossible situation but then too
it make sense to make one local variable and then always read the
value from that.
 
Style issue here:

> + long start_block = file->numFiles * BUFFILE_SEG_SIZE;
> + int newNumFiles = file->numFiles + source->numFiles;

Shouldn't start_block conform to the surrounding camelCase style?


Done.
 
Finally, two new thoughts on the patch, that are not responses to
anything you did in v14:

1. Thomas' barrier abstraction was added by commit 1145acc7. I think
that you should use a static barrier in tuplesort.c now, and rip out
the ConditionVariable fields in the Sharedsort struct. It's only a
slightly higher level of abstraction for tuplesort.c, which makes only
a small difference given the simple requirements of tuplesort.c.
However, I see no reason to not go that way if that's the new
standard, which it is. This looks like it will be fairly easy.


Pending, as per Thomas' explanation,  it seems like need some more
work in the barrier APIs.

 
2. Does the plan_create_index_workers() cost model need to account for
parallel_leader_participation, too, when capping workers? I think that
it does.

The relevant planner code is:

> +   /*
> +    * Cap workers based on available maintenance_work_mem as needed.
> +    *
> +    * Note that each tuplesort participant receives an even share of the
> +    * total maintenance_work_mem budget.  Aim to leave workers (where
> +    * leader-as-worker Tuplesortstate counts as a worker) with no less than
> +    * 32MB of memory.  This leaves cases where maintenance_work_mem is set to
> +    * 64MB immediately past the threshold of being capable of launching a
> +    * single parallel worker to sort.
> +    */
> +   sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ;
> +   min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ;
> +   while (parallel_workers > min_parallel_workers &&
> +          sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks)
> +       parallel_workers--;

This parallel CREATE INDEX planner code snippet is about the need to
have low per-worker maintenance_work_mem availability prevent more
parallel workers from being added to the number that we plan to
launch. Each worker tuplesort state needs at least 32MB. We clearly
need to do something here.

While it's always true that "leader-as-worker Tuplesortstate counts as
a worker" in v14, I think that it should only be true in the next
revision of the patch when parallel_leader_participation is actually
true (IOW, we should only add 1 to parallel_workers within the loop
invariant in that case). The reason why we need to consider
parallel_leader_participation within this plan_create_index_workers()
code is simple: During execution, _bt_leader_sort_as_worker() uses
"worker tuplesort states"/btshared->scantuplesortstates to determine
how much of a share of maintenance_work_mem each worker tuplesort
gets. Our planner code needs to take that into account, now that the
nbtsort.c parallel_leader_participation behavior isn't just some
obscure debug option. IOW, the planner code needs to be consistent
with the nbtsort.c execution code.


Ah nice catch.  I passed the local variable (leaderasworker) of _bt_heapscan()
to plan_create_index_workers() rather than direct reading value from the
parallel_leader_participation (reasons are same as you explained earlier).



Thanks,
Rushabh Lathia
Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree indexcreation)

From
"Tels"
Date:
Hello Rushabh,

On Fri, December 8, 2017 2:28 am, Rushabh Lathia wrote:
> Thanks for review.
>
> On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>
>> On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>> > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

I've looked only at patch 0002, here are some comments.

> + * leaderasworker indicates whether leader going to participate as
worker  or
> + * not.

The grammar is a bit off, and the "or not" seems obvious. IMHO this could be:

+ * leaderasworker indicates whether the leader is going to participate as
worker

The argument leaderasworker is only used once and for one temp. variable
that is only used once, too. So the temp. variable could maybe go.

And not sure what the verdict was from the const-discussion threads, I did
not follow it through. If "const" is what should be done generally, then
the argument could be consted, as to not create more "unconsted" code.

E.g. so:

+plan_create_index_workers(Oid tableOid, Oid indexOid, const bool
leaderasworker)

and later:

-                   sort_mem_blocks / (parallel_workers + 1) <
min_sort_mem_blocks)
+                   sort_mem_blocks / (parallel_workers + (leaderasworker
? 1 : 0)) < min_sort_mem_blocks)

Thank you for working on this patch!

All the best,

Tels




Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Dec 7, 2017 at 11:28 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> I thought parallel_leader_participation is generic GUC which get effect
> for all parallel operation. isn't it?  On that understanding I just update
> the
> documentation of parallel_leader_participation into config.sgml to
> make it more generalize.

Okay. I'm not quite sure how to fit parallel_leader_participation into
parallel CREATE INDEX (see my remarks on that below).

I see a new bug in the patch (my own bug). Which is: the CONCURRENTLY
case should obtain a RowExclusiveLock on the index relation within
_bt_worker_main(), not an AccessExclusiveLock. That's all the leader
has at that point within CREATE INDEX CONCURRENTLY.

I now believe that index_create() should reject catalog parallel
CREATE INDEX directly, just as it does for catalog CREATE INDEX
CONCURRENTLY. That logic should be generic to all AMs, since the
reasons for disallowing catalog parallel index builds are generic.

On a similar note, *maybe* we should even call
plan_create_index_workers() from index_create() (or at least some
point within index.c). You're going to need a new field or two within
IndexInfo for this, beside ii_Concurrent/ii_BrokenHotChain (next to
the other stuff that is only used during index builds). Maybe
ii_ParallelWorkers, and ii_LeaderAsWorker. What do you think of this
suggestion? It's probably neater overall...though I'm less confident
that this one is an improvement.

Note that cluster.c calls plan_cluster_use_sort() directly, while
checking "OldIndex->rd_rel->relam == BTREE_AM_OID" as a prerequisite
to calling it. This seems like it might be considered an example that
we should follow within index.c -- plan_create_index_workers() is
based on plan_cluster_use_sort().

> Yes, to me also it's looks kind of impossible situation but then too
> it make sense to make one local variable and then always read the
> value from that.

I think that it probably is technically possible, though the user
would have to be doing something insane for it to be a problem. As I'm
sure you understand, it's simpler to eliminate the possibility than it
is to reason about it never happening.

>> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
>> that you should use a static barrier in tuplesort.c now, and rip out
>> the ConditionVariable fields in the Sharedsort struct.

> Pending, as per Thomas' explanation,  it seems like need some more
> work in the barrier APIs.

Okay. It's not the case that parallel tuplesort would significantly
benefit from using the barrier abstraction, so I don't think we need
to consider this a blocker to commit. My concern is mostly just that
everyone is on the same page with barriers.

> Ah nice catch.  I passed the local variable (leaderasworker) of
> _bt_heapscan()
> to plan_create_index_workers() rather than direct reading value from the
> parallel_leader_participation (reasons are same as you explained earlier).

Cool. I don't think that this should be a separate patch -- please
rebase + squash.

Do you think that the main part of the cost model needs to care about
parallel_leader_participation, too?

compute_parallel_worker() assumes that the caller is planning a
parallel-sequential-scan-alike thing, in the sense that the leader
only acts like a worker in cases that probably don't have many
workers, where the leader cannot keep itself busy as a leader. That's
actually quite different to parallel CREATE INDEX, because the
leader-as-worker state will behave in exactly the same way as a worker
would, no matter how many workers there are. The leader process is
guaranteed to give its full attention to being a worker, because it
has precisely nothing else to do until workers finish. This makes me
think that we may need to immediately do something with the result of
compute_parallel_worker(), to consider whether or not a
leader-as-worker state should be used, despite the fact that no
existing compute_parallel_worker() caller does anything like this.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:
Thanks Tels for reviewing the patch.

On Fri, Dec 8, 2017 at 2:54 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
Hello Rushabh,

On Fri, December 8, 2017 2:28 am, Rushabh Lathia wrote:
> Thanks for review.
>
> On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>
>> On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>> > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

I've looked only at patch 0002, here are some comments.

> + * leaderasworker indicates whether leader going to participate as
worker  or
> + * not.

The grammar is a bit off, and the "or not" seems obvious. IMHO this could be:

+ * leaderasworker indicates whether the leader is going to participate as
worker


Sure.
 
The argument leaderasworker is only used once and for one temp. variable
that is only used once, too. So the temp. variable could maybe go.

And not sure what the verdict was from the const-discussion threads, I did
not follow it through. If "const" is what should be done generally, then
the argument could be consted, as to not create more "unconsted" code.

E.g. so:

+plan_create_index_workers(Oid tableOid, Oid indexOid, const bool
leaderasworker)


Make sense.
 
and later:

-                   sort_mem_blocks / (parallel_workers + 1) <
min_sort_mem_blocks)
+                   sort_mem_blocks / (parallel_workers + (leaderasworker
? 1 : 0)) < min_sort_mem_blocks)


Even I didn't liked to take a extra variable, but then code looks bit
unreadable - so rather then making difficult to read the code - I thought
of adding new variable.
 
Thank you for working on this patch!


I will address review comments in the next set of patches.
 


Regards,
Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:


On Sun, Dec 10, 2017 at 3:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Dec 7, 2017 at 11:28 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> I thought parallel_leader_participation is generic GUC which get effect
> for all parallel operation. isn't it?  On that understanding I just update
> the
> documentation of parallel_leader_participation into config.sgml to
> make it more generalize.

Okay. I'm not quite sure how to fit parallel_leader_participation into
parallel CREATE INDEX (see my remarks on that below).

I see a new bug in the patch (my own bug). Which is: the CONCURRENTLY
case should obtain a RowExclusiveLock on the index relation within
_bt_worker_main(), not an AccessExclusiveLock. That's all the leader
has at that point within CREATE INDEX CONCURRENTLY.


Oh right. I also missed to test that earlier. Fixed now.
 
I now believe that index_create() should reject catalog parallel
CREATE INDEX directly, just as it does for catalog CREATE INDEX
CONCURRENTLY. That logic should be generic to all AMs, since the
reasons for disallowing catalog parallel index builds are generic.


Sorry I didn't get this, reject means? you mean it should throw an error
catalog parallel CREATE INDEX? or just suggesting to set the
ParallelWorkers and may be LeaderAsWorker from index_create()
or may be index_build()?

 
On a similar note, *maybe* we should even call
plan_create_index_workers() from index_create() (or at least some
point within index.c). You're going to need a new field or two within
IndexInfo for this, beside ii_Concurrent/ii_BrokenHotChain (next to
the other stuff that is only used during index builds). Maybe
ii_ParallelWorkers, and ii_LeaderAsWorker. What do you think of this
suggestion? It's probably neater overall...though I'm less confident
that this one is an improvement.

Note that cluster.c calls plan_cluster_use_sort() directly, while
checking "OldIndex->rd_rel->relam == BTREE_AM_OID" as a prerequisite
to calling it. This seems like it might be considered an example that
we should follow within index.c -- plan_create_index_workers() is
based on plan_cluster_use_sort().

> Yes, to me also it's looks kind of impossible situation but then too
> it make sense to make one local variable and then always read the
> value from that.

I think that it probably is technically possible, though the user
would have to be doing something insane for it to be a problem. As I'm
sure you understand, it's simpler to eliminate the possibility than it
is to reason about it never happening.


yes.
 
>> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
>> that you should use a static barrier in tuplesort.c now, and rip out
>> the ConditionVariable fields in the Sharedsort struct.

> Pending, as per Thomas' explanation,  it seems like need some more
> work in the barrier APIs.

Okay. It's not the case that parallel tuplesort would significantly
benefit from using the barrier abstraction, so I don't think we need
to consider this a blocker to commit. My concern is mostly just that
everyone is on the same page with barriers.


True, if needed, this can be also done later on.
 
> Ah nice catch.  I passed the local variable (leaderasworker) of
> _bt_heapscan()
> to plan_create_index_workers() rather than direct reading value from the
> parallel_leader_participation (reasons are same as you explained earlier).

Cool. I don't think that this should be a separate patch -- please
rebase + squash.


Sure, done.
 
Do you think that the main part of the cost model needs to care about
parallel_leader_participation, too?

compute_parallel_worker() assumes that the caller is planning a
parallel-sequential-scan-alike thing, in the sense that the leader
only acts like a worker in cases that probably don't have many
workers, where the leader cannot keep itself busy as a leader. That's
actually quite different to parallel CREATE INDEX, because the
leader-as-worker state will behave in exactly the same way as a worker
would, no matter how many workers there are. The leader process is
guaranteed to give its full attention to being a worker, because it
has precisely nothing else to do until workers finish. This makes me
think that we may need to immediately do something with the result of
compute_parallel_worker(), to consider whether or not a
leader-as-worker state should be used, despite the fact that no
existing compute_parallel_worker() caller does anything like this.


I agree with you. compute_parallel_worker() mainly design for the
scan-alike things. Where as parallel create index is different in a
sense where leader has as much power as worker.  But at the same
time I don't see any side effect or negative of that with PARALLEL
CREATE INDEX.  So I am more towards not changing that aleast
for now - as part of this patch.
 
Thanks for review.

Regards,
Rushabh Lathia
Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Dec 12, 2017 at 2:09 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>> I now believe that index_create() should reject catalog parallel
>> CREATE INDEX directly, just as it does for catalog CREATE INDEX
>> CONCURRENTLY. That logic should be generic to all AMs, since the
>> reasons for disallowing catalog parallel index builds are generic.
>>
>
> Sorry I didn't get this, reject means? you mean it should throw an error
> catalog parallel CREATE INDEX? or just suggesting to set the
> ParallelWorkers and may be LeaderAsWorker from index_create()
> or may be index_build()?

I mean that we should be careful to make sure that AM-generic parallel
CREATE INDEX logic does not end up in a specific AM (nbtree).

The patch *already* refuses to perform a parallel CREATE INDEX on a
system catalog, which is what I meant by reject (sorry for being
unclear). The point is that that's due to a restriction that has
nothing to do with nbtree in particular (just like the CIC restriction
on catalogs), so it should be performed within index_build(). Just
like the similar CONCURRENTLY-on-a-catalog restriction, though without
throwing an error, since of course the user doesn't explicitly ask for
a parallel CREATE INDEX at any point (unlike CONCURRENTLY).

Once we go this way, the cost model has to be called at that point,
too. We already have the AM-specific "OldIndex->rd_rel->relam ==
BTREE_AM_OID" tests within cluster.c, even though theoretically
another AM might be involved with CLUSTER in the future, which this
seems similar to.

So, I propose the following (this is a rough outline):

* Add new IndexInfo files after ii_Concurrent/ii_BrokenHotChain  --
ii_ParallelWorkers and ii_LeaderAsWorker.

* Call plan_create_index_workers() within index_create(), assigning to
ii_ParallelWorkers, and fill in ii_LeaderAsWorker from the
parallel_leader_participation GUC. Add comments along the lines of
"only nbtree supports parallel builds". Test the index with a
"heapRelation->rd_rel->relam == BTREE_AM_OID" to make this work.
Otherwise, assign zero to ii_ParallelWorkers (and leave
ii_LeaderAsWorker as false).

* For builds on catalogs, or builds using other AMs, don't let
parallelism go ahead by immediately assigning zero to
ii_ParallelWorkers within index_create(), near where the similar CIC
test occurs already.

What do you think of that?

>> Do you think that the main part of the cost model needs to care about
>> parallel_leader_participation, too?
>>
>> compute_parallel_worker() assumes that the caller is planning a
>> parallel-sequential-scan-alike thing, in the sense that the leader
>> only acts like a worker in cases that probably don't have many
>> workers, where the leader cannot keep itself busy as a leader. That's
>> actually quite different to parallel CREATE INDEX, because the
>> leader-as-worker state will behave in exactly the same way as a worker
>> would, no matter how many workers there are. The leader process is
>> guaranteed to give its full attention to being a worker, because it
>> has precisely nothing else to do until workers finish. This makes me
>> think that we may need to immediately do something with the result of
>> compute_parallel_worker(), to consider whether or not a
>> leader-as-worker state should be used, despite the fact that no
>> existing compute_parallel_worker() caller does anything like this.
>>
>
> I agree with you. compute_parallel_worker() mainly design for the
> scan-alike things. Where as parallel create index is different in a
> sense where leader has as much power as worker.  But at the same
> time I don't see any side effect or negative of that with PARALLEL
> CREATE INDEX.  So I am more towards not changing that aleast
> for now - as part of this patch.

I've also noticed is that there is little to no negative effect on
CREATE INDEX duration from adding new workers past the point where
adding more workers stops making the build faster. It's quite clear.
And, in general, there isn't all that much theoretical justification
for the cost model (it's essentially the same as any other parallel
scan), which doesn't seem to matter much. So, I agree that it doesn't
really matter in practice, but disagree that it should not still be
changed -- the justification may be a little thin, but I think that we
need to stick to it. There should be a theoretical justification for
the cost model that is coherent in the wider context of costs models
for parallelism in general. It should not be arbitrarily inconsistent
just because it apparently doesn't matter that much. It's easy to fix
-- let's just fix it.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:


On Sun, Dec 31, 2017 at 9:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Dec 12, 2017 at 2:09 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>> I now believe that index_create() should reject catalog parallel
>> CREATE INDEX directly, just as it does for catalog CREATE INDEX
>> CONCURRENTLY. That logic should be generic to all AMs, since the
>> reasons for disallowing catalog parallel index builds are generic.
>>
>
> Sorry I didn't get this, reject means? you mean it should throw an error
> catalog parallel CREATE INDEX? or just suggesting to set the
> ParallelWorkers and may be LeaderAsWorker from index_create()
> or may be index_build()?

I mean that we should be careful to make sure that AM-generic parallel
CREATE INDEX logic does not end up in a specific AM (nbtree).


Ah okay, that's what I thought.
 
The patch *already* refuses to perform a parallel CREATE INDEX on a
system catalog, which is what I meant by reject (sorry for being
unclear). The point is that that's due to a restriction that has
nothing to do with nbtree in particular (just like the CIC restriction
on catalogs), so it should be performed within index_build(). Just
like the similar CONCURRENTLY-on-a-catalog restriction, though without
throwing an error, since of course the user doesn't explicitly ask for
a parallel CREATE INDEX at any point (unlike CONCURRENTLY).

Once we go this way, the cost model has to be called at that point,
too. We already have the AM-specific "OldIndex->rd_rel->relam ==
BTREE_AM_OID" tests within cluster.c, even though theoretically
another AM might be involved with CLUSTER in the future, which this
seems similar to.

So, I propose the following (this is a rough outline):

* Add new IndexInfo files after ii_Concurrent/ii_BrokenHotChain  --
ii_ParallelWorkers and ii_LeaderAsWorker.

* Call plan_create_index_workers() within index_create(), assigning to
ii_ParallelWorkers, and fill in ii_LeaderAsWorker from the
parallel_leader_participation GUC. Add comments along the lines of
"only nbtree supports parallel builds". Test the index with a
"heapRelation->rd_rel->relam == BTREE_AM_OID" to make this work.
Otherwise, assign zero to ii_ParallelWorkers (and leave
ii_LeaderAsWorker as false).

* For builds on catalogs, or builds using other AMs, don't let
parallelism go ahead by immediately assigning zero to
ii_ParallelWorkers within index_create(), near where the similar CIC
test occurs already.

What do you think of that?

Need to do after the indexRelation build. So I added after update of pg_index,
as indexRelation needed for plan_create_index_worders().

Attaching the separate patch the same.
 

>> Do you think that the main part of the cost model needs to care about
>> parallel_leader_participation, too?
>>
>> compute_parallel_worker() assumes that the caller is planning a
>> parallel-sequential-scan-alike thing, in the sense that the leader
>> only acts like a worker in cases that probably don't have many
>> workers, where the leader cannot keep itself busy as a leader. That's
>> actually quite different to parallel CREATE INDEX, because the
>> leader-as-worker state will behave in exactly the same way as a worker
>> would, no matter how many workers there are. The leader process is
>> guaranteed to give its full attention to being a worker, because it
>> has precisely nothing else to do until workers finish. This makes me
>> think that we may need to immediately do something with the result of
>> compute_parallel_worker(), to consider whether or not a
>> leader-as-worker state should be used, despite the fact that no
>> existing compute_parallel_worker() caller does anything like this.
>>
>
> I agree with you. compute_parallel_worker() mainly design for the
> scan-alike things. Where as parallel create index is different in a
> sense where leader has as much power as worker.  But at the same
> time I don't see any side effect or negative of that with PARALLEL
> CREATE INDEX.  So I am more towards not changing that aleast
> for now - as part of this patch.

I've also noticed is that there is little to no negative effect on
CREATE INDEX duration from adding new workers past the point where
adding more workers stops making the build faster. It's quite clear.
And, in general, there isn't all that much theoretical justification
for the cost model (it's essentially the same as any other parallel
scan), which doesn't seem to matter much. So, I agree that it doesn't
really matter in practice, but disagree that it should not still be
changed -- the justification may be a little thin, but I think that we
need to stick to it. There should be a theoretical justification for
the cost model that is coherent in the wider context of costs models
for parallelism in general. It should not be arbitrarily inconsistent
just because it apparently doesn't matter that much. It's easy to fix
-- let's just fix it.

So you suggesting that need to do adjustment with the output of
compute_parallel_worker() by considering parallel_leader_participation?


Thanks,
Rushabh Lathia
Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Jan 2, 2018 at 1:38 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Need to do after the indexRelation build. So I added after update of
> pg_index,
> as indexRelation needed for plan_create_index_worders().
>
> Attaching the separate patch the same.

This made it so that REINDEX and CREATE INDEX CONCURRENTLY no longer
used parallelism. I think we need to do this very late, just before
nbtree's ambuild() routine is called from index.c.

> So you suggesting that need to do adjustment with the output of
> compute_parallel_worker() by considering parallel_leader_participation?

We know for sure that there is no reason to not use the leader process
as a worker process in the case of parallel CREATE INDEX. So we must
not have the number of participants (i.e. worker Tuplesortstates) vary
based on the current parallel_leader_participation setting. While
parallel_leader_participation can affect the number of worker
processes requested, that's a different thing. There is no question
about parallel_leader_participation ever being relevant to performance
-- it's strictly a testing option for us.

Even after parallel_leader_participation was added,
compute_parallel_worker() still assumes that the sequential scan
leader is always too busy to help. compute_parallel_worker() seems to
think that that's something that the leader does in "rare" cases not
worth considering -- cases where it has no worker tuples to consume
(maybe I'm reading too much into it not caring about
parallel_leader_participation, but I don't think so). If
compute_parallel_worker()'s assumption was questionable before, it's
completely wrong for parallel CREATE INDEX. I think
plan_create_index_workers() needs to count the leader-as-worker as an
ordinary worker, not special in any way by deducting one worker from
what compute_parallel_worker() returns. (This only happens when it's
necessary to compensate -- when leader-as-worker participation is
going to go ahead.)

I'm working on fixing up what you posted. I'm probably not more than a
week away from posting a patch that I'm going to mark "ready for
committer". I've already made the change above, and once I spend time
on trying to break the few small changes needed within buffile.c I'll
have taken it as far as I can, most likely.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:


On Wed, Jan 3, 2018 at 9:11 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Jan 2, 2018 at 1:38 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Need to do after the indexRelation build. So I added after update of
> pg_index,
> as indexRelation needed for plan_create_index_worders().
>
> Attaching the separate patch the same.

This made it so that REINDEX and CREATE INDEX CONCURRENTLY no longer
used parallelism. I think we need to do this very late, just before
nbtree's ambuild() routine is called from index.c.


Ahh right.  We should move the plan_create_index_workers() call to
index_build() before the ambuild().

 
> So you suggesting that need to do adjustment with the output of
> compute_parallel_worker() by considering parallel_leader_participation?

We know for sure that there is no reason to not use the leader process
as a worker process in the case of parallel CREATE INDEX. So we must
not have the number of participants (i.e. worker Tuplesortstates) vary
based on the current parallel_leader_participation setting. While
parallel_leader_participation can affect the number of worker
processes requested, that's a different thing. There is no question
about parallel_leader_participation ever being relevant to performance
-- it's strictly a testing option for us.

Even after parallel_leader_participation was added,
compute_parallel_worker() still assumes that the sequential scan
leader is always too busy to help. compute_parallel_worker() seems to
think that that's something that the leader does in "rare" cases not
worth considering -- cases where it has no worker tuples to consume
(maybe I'm reading too much into it not caring about
parallel_leader_participation, but I don't think so). If
compute_parallel_worker()'s assumption was questionable before, it's
completely wrong for parallel CREATE INDEX. I think
plan_create_index_workers() needs to count the leader-as-worker as an
ordinary worker, not special in any way by deducting one worker from
what compute_parallel_worker() returns. (This only happens when it's
necessary to compensate -- when leader-as-worker participation is
going to go ahead.)


Yes, event with parallel_leader_participation - compute_parallel_worker()
doesn't take that into consideration.  Or may be the assumption is that
launch the number of workers return by the compute_parallel_worker(),
irrespective of whether leader is going to participate in a scan or not.

I agree that plan_create_index_workers() needs to count the leader as a
normal worker for the CREATE INDEX.  So what you proposing is - when
parallel_leader_participation is true launch (return value of compute_parallel_worker() - 1)
workers. true ?

I'm working on fixing up what you posted. I'm probably not more than a
week away from posting a patch that I'm going to mark "ready for
committer". I've already made the change above, and once I spend time
on trying to break the few small changes needed within buffile.c I'll
have taken it as far as I can, most likely.


Okay, once you submit the patch with changes - I will do one round of
review for the changes.

Thanks,
Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> I agree that plan_create_index_workers() needs to count the leader as a
> normal worker for the CREATE INDEX.  So what you proposing is - when
> parallel_leader_participation is true launch (return value of
> compute_parallel_worker() - 1)
> workers. true ?

Almost. We need to not subtract one when only one worker is indicated
by compute_parallel_worker(). I also added some new stuff there, to
consider edge cases with the parallel_leader_participation GUC.

>> I'm working on fixing up what you posted. I'm probably not more than a
>> week away from posting a patch that I'm going to mark "ready for
>> committer". I've already made the change above, and once I spend time
>> on trying to break the few small changes needed within buffile.c I'll
>> have taken it as far as I can, most likely.
>>
>
> Okay, once you submit the patch with changes - I will do one round of
> review for the changes.

I've attached my revision. Changes include:

* Changes to plan_create_index_workers() were made along the lines
recently discussed.

* plan_create_index_workers() is now called right before the ambuild
routine is called (nbtree index builds only, of course).

* Significant overhaul of tuplesort.h contract. This had references to
the old approach, and to tqueue.c's tuple descriptor thing that was
since superseded by the typmod registry added for parallel hash join.
These were updated/removed.

* Both tuplesort.c and logtape.c now say that they cannot write to the
writable/last tape, while still acknowledging that it is in fact the
leader tape, and that this restriction is due to a restriction with
BufFiles. They also point out that if the restriction within buffile.c
ever was removed, everything would work fine.

* Added new call to BufFileExportShared() when freezing tape in logtape.c.

* Tweaks to documentation.

* pgindent ran on modified files.

* Polished the stuff that is added to buffile.c. Mostly comments that
clarify its reason for existing. Also added Assert()s.

Note that I added Heikki as an author in the commit message.
Technically, Heikki didn't actually write code for parallel CREATE
INDEX, but he did loads of independently useful work on merging + temp
file I/O that went into Postgres 10 (though this wasn't listed in the
v10 release notes). That work was done in large part to help the
parallel CREATE INDEX patch, and it did in fact help it quite
noticeably, so I think that this is warranted. Remember that with
parallel CREATE INDEX, the leader's merge occurs serially, so anything
that we can do to speed that part up is very helpful.

This revision does seem very close, but I'll hold off on changing the
status of the patch for a few more days, to give you time to give some
feedback.

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:


On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> I agree that plan_create_index_workers() needs to count the leader as a
> normal worker for the CREATE INDEX.  So what you proposing is - when
> parallel_leader_participation is true launch (return value of
> compute_parallel_worker() - 1)
> workers. true ?

Almost. We need to not subtract one when only one worker is indicated
by compute_parallel_worker(). I also added some new stuff there, to
consider edge cases with the parallel_leader_participation GUC.

>> I'm working on fixing up what you posted. I'm probably not more than a
>> week away from posting a patch that I'm going to mark "ready for
>> committer". I've already made the change above, and once I spend time
>> on trying to break the few small changes needed within buffile.c I'll
>> have taken it as far as I can, most likely.
>>
>
> Okay, once you submit the patch with changes - I will do one round of
> review for the changes.

I've attached my revision. Changes include:

* Changes to plan_create_index_workers() were made along the lines
recently discussed.

* plan_create_index_workers() is now called right before the ambuild
routine is called (nbtree index builds only, of course).

* Significant overhaul of tuplesort.h contract. This had references to
the old approach, and to tqueue.c's tuple descriptor thing that was
since superseded by the typmod registry added for parallel hash join.
These were updated/removed.

* Both tuplesort.c and logtape.c now say that they cannot write to the
writable/last tape, while still acknowledging that it is in fact the
leader tape, and that this restriction is due to a restriction with
BufFiles. They also point out that if the restriction within buffile.c
ever was removed, everything would work fine.

* Added new call to BufFileExportShared() when freezing tape in logtape.c.

* Tweaks to documentation.

* pgindent ran on modified files.

* Polished the stuff that is added to buffile.c. Mostly comments that
clarify its reason for existing. Also added Assert()s.

Note that I added Heikki as an author in the commit message.
Technically, Heikki didn't actually write code for parallel CREATE
INDEX, but he did loads of independently useful work on merging + temp
file I/O that went into Postgres 10 (though this wasn't listed in the
v10 release notes). That work was done in large part to help the
parallel CREATE INDEX patch, and it did in fact help it quite
noticeably, so I think that this is warranted. Remember that with
parallel CREATE INDEX, the leader's merge occurs serially, so anything
that we can do to speed that part up is very helpful.

This revision does seem very close, but I'll hold off on changing the
status of the patch for a few more days, to give you time to give some
feedback.


Thanks Peter for the updated patch.

I gone through the changes and perform the basic testing. Changes
looks good and haven't found any unusual during testing

Thanks,
Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Jan 8, 2018 at 9:44 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> I gone through the changes and perform the basic testing. Changes
> looks good and haven't found any unusual during testing

Then I'll mark the patch "Ready for Committer" now. I think that we've
done just about all we can with it.

There is one lingering concern that I cannot shake, which stems from
the fact that the cost model (plan_create_index_workers()) follows the
same generic logic for adding workers as parallel sequential scan, per
Robert's feedback from around March of last year (that is, we more or
less just reuse compute_parallel_worker()). My specific concern is
that this approach may be too aggressive in situations where a
parallel external sort ends up being used instead of a serial internal
sort. No weight is given to any extra temp file costs; a serial
external sort is, in a sense, the baseline, including in cases where
the table is very small and an external sort can actually easily be
avoided iff we do a serial sort.

This is probably not worth doing anything about. The distinction
between internal and external sorts became rather blurred in 9.6 and
10, which, in a way, this patch builds on. If what I describe is a
problem at all, it will very probably only be a problem on small
CREATE INDEX operations, where linear sequential I/O costs are not
already dwarfed by the linearithmic CPU costs. (The dominance of
CPU/comparison costs on larger sorts is the main reason why external
sorts can be faster than internal sorts -- this happens fairly
frequently these days, especially with CREATE INDEX, where being able
to write out the index as it merges on-the-fly helps a lot.)

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Sat, Jan 6, 2018 at 11:17 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> * Significant overhaul of tuplesort.h contract. This had references to
> the old approach, and to tqueue.c's tuple descriptor thing that was
> since superseded by the typmod registry added for parallel hash join.
> These were updated/removed.

+1

> * Both tuplesort.c and logtape.c now say that they cannot write to the
> writable/last tape, while still acknowledging that it is in fact the
> leader tape, and that this restriction is due to a restriction with
> BufFiles. They also point out that if the restriction within buffile.c
> ever was removed, everything would work fine.

+1

> * Added new call to BufFileExportShared() when freezing tape in logtape.c.

+1

> * Polished the stuff that is added to buffile.c. Mostly comments that
> clarify its reason for existing. Also added Assert()s.

+1

This looks good to me.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Jan 9, 2018 at 10:36 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> This looks good to me.

The addition to README.parallel is basically wrong, because workers
have been allowed to write WAL since the ParallelContext machinery.
See the
XactLastRecEnd handling in parallel.c.  Workers can, for example, due
HOT cleanups during SELECT scans, just as the leader can.  The
language here is obsolete anyway in light of commit
e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way
to update it.  I'll propose a separate patch for that.

The change to the ParallelContext signature in parallel.h makes an
already-overlength line even longer.  A line break seems warranted
just after the first argument, plus pgindent afterward.

I am not a fan of the leader-as-worker terminology.  The leader is not
a worker, full stop.  I think we should instead talk about whether the
leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates,
for example, plus many comment updates).  Similarly, it seems
SortCoordinateData's nLaunched should be nParticipants, and BTLeader's
nworkertuplesorts should be nparticipanttuplesorts.

There is also the question of whether we want to respect
parallel_leader_participation in this context.  The issues which might
motivate the desire for such behavior in the context of a query do not
exist when creating a btree index, so maybe we're just making this
complicated. On the other hand, if some other type of parallel index
build does end up doing a Gather-like operation then we might regret
deciding that parallel_leader_participation doesn't apply to index
builds, so maybe it's OK the way we have it.  On the third hand, the
complexity of having the leader maybe-participate seems like it
extends to a fair number of places in the code, and getting rid of all
that complexity seems appealing.

One place where this actually causes a problem is the message changes
to index_build().  The revised ereport() violates translatability
guidelines, which require that messages not be assembled from pieces.
See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES

A comment added to tuplesort.h says that work_mem should be at least
64KB, but does not give any reason.  I think one should be given, at
least briefly, so that someone looking at these comments in the future
can, for example, figure out whether the comment is still correct
after future code changes.  Or else, remove the comment.

+ * Parallel sort callers are required to coordinate multiple tuplesort states
+ * in a leader process, and one or more worker processes.  The leader process

I think the comma should be removed.  As written it, it looks like we
are coordinating multiple tuplesort states in a leader process, and,
separately, we are coordinating one or more worker processes.  But in
fact we are coordinating multiple tuplesort states which are in a
group of processes that includes the leader and one or more worker
processes.

Generally, I think the comments in tuplesort.h are excellent.  I
really like the overview of how the new interfaces should be used,
although I find it slightly wonky that the leader needs two separate
Tuplesortstates if it wants to participate.

I don't understand why this patch needs to tinker with the tests in
vacuum.sql.  The comments say that "If we did not do this, errors
raised would concern running ANALYZE in parallel mode."  However, why
should parallel CREATE INDEX having any impact on ANALYZE at all?
Also, as a practical matter, if I revert those changes, 'make check'
still passes with or without force_parallel_mode=on.

I really dislike the fact that this patch invents another thing for
force_parallel_mode to do.  I invented force_parallel_mode mostly as a
way of testing that functions were correctly labeled for
parallel-safety, and I think it would be just fine if it never does
anything else.  As it is, it already does two quite separate things to
accomplish that goal: (1) forcibly run the whole plan with parallel
mode restrictions enabled, provided that the plan is not
parallel-unsafe, and (2) runs the plan in a worker, provided that the
plan is parallel-safe.  There's a subtle difference between those two
condition, which is that not parallel-unsafe does not equal
parallel-safe; there is also parallel-restricted.  The fact that
force_parallel_mode controls two different behaviors has, I think,
already caused some confusion for prominent PostgreSQL developers and,
likely, users as well.  Making it do a third thing seems to me to be
adding to the confusion, and not only because there are no
documentation changes to match.  If we go down this road, there will
probably be more additions -- what happens when parallel VACUUM
arrives, or parallel CLUSTER, or whatever?  I don't think it will be a
good thing for PostgreSQL if we end up with force_parallel_mode=on as
a general "use parallelism even though it's stupid" flag, requiring
supporting code in many different places throughout the code base and
a laundry list of not-actually-useful behavior changes in the
documentation.

What I think would be a lot more useful, and what I sort of expected
the patch to have, is a way for a user to explicitly control the
number of workers requested for a CREATE INDEX operation.  We all know
that the cost model is crude and that may be OK -- though it would be
interesting to see some research on what the run times actually look
like for various numbers of workers at various table sizes and
work_mem settings -- but it will be inconvenient for DBAs who actually
know what number of workers they want to use to instead get whatever
value plan_create_index_workers() decide to emit.  They can force it
by setting the parallel_workers reloption, but that affects queries.
They can probably also do it by setting min_parallel_table_scan_size =
0 and  max_parallel_workers_maintenance to whatever value they want,
but I think it would be convenient for there to be a more
straightforward way to do it, or at least some documentation in the
CREATE INDEX page about how to get the number of workers you really
want.  To be clear, I don't think that this is a must-fix issue for
this patch to get committed, but I do think that all reference to
force_parallel_mode=on should go away.

I do not like the way that this patch wants to turn the section of the
documentation on when parallel query can be used into a discussion of
when parallelism can be used.  I think it would be better to leave
that section alone and instead document under CREATE INDEX the
concerns specific to parallel index build. I think this will be easier
for users to understand and far easier to maintain as the number of
parallel DDL operations increases, which I expect it to do somewhat
explosively.  The patch as written says things like "If a utility
statement that is expected to do so does not produce a parallel plan,
..."  but, one, utility statements *do not produce plans of any type*
and, two, the concerns here are really specific to parallel CREATE
INDEX and there is every reason to think that they might be different
in other cases.  I feel strongly that it's enough for this section to
try to explain the concerns that pertain to optimizable queries and
leave utility commands to be treated elsewhere.  If we find that we're
accumulating a lot of documentation for various parallel utility
commands that seems to be duplicative, we can write a general
treatment of that topic that is separate from this one.

The documentation for max_parallel_workers_maintenance cribs from the
documentation for max_parallel_workers_per_gather in saying that we'll
use fewer workers than expected "which may be inefficient".  However,
for parallel CREATE INDEX, that trailing clause is, at least as far as
I can see, not applicable.  For a query, we might choose a Gather over
a Parallel Seq Scan because we think we've got a lot of workers; with
only one participant, we might prefer a GIN index scan.  If it turns
out we don't get the workers, we've got a clearly suboptimal plan.
For CREATE INDEX, though, it seems to me that we don't make any
decisions based on the number of workers we think we'll have.  If we
get fewer workers, it may be slower, but it shouldn't still be as fast
as it can be with that number of workers, which for queries is not the
case.

+     * These fields are not modified throughout the sort.  They primarily
+     * exist for the benefit of worker processes, that need to create BTSpool
+     * state corresponding to that used by the leader.

throughout -> during

remove comma

+     * builds, that must work just the same when an index is built in

remove comma

+     * State that is aggregated by workers, to report back to leader.

State that is maintained by workers and reported back to leader.

+     * indtuples is the total number of tuples that made it into index.

into the index

+     * btleader is only present when a parallel index build is performed, and
+     * only in leader process (actually, only the leader has a BTBuildState.
+     * Workers have their own spool and spool2, though.)

the leader process
period after "process"
capitalize actually

+     * Done.  Leave a way for leader to determine we're finished.  Record how
+     * many tuples were in this worker's share of the relation.

I don't understand what the "Leave a way" comment means.

+ * To support parallel sort operations involving coordinated callers to
+ * tuplesort.c routines across multiple workers, it is necessary to
+ * concatenate each worker BufFile/tapeset into one single leader-wise
+ * logical tapeset.  Workers should have produced one final materialized
+ * tape (their entire output) when this happens in leader; there will always
+ * be the same number of runs as input tapes, and the same number of input
+ * tapes as workers.

I can't interpret the word "leader-wise".  A partition-wise join is a
join done one partition at a time, but a leader-wise logical tape set
is not done one leader at a time.  If there's another meaning to the
affix -wise, I'm not familiar with it.  Don't we just mean "a single
logical tapeset managed by the leader"?

There's a lot here I haven't grokked yet, but I'm running out of
mental energy so I think I'll send this for now and work on this some
more when time permits, hopefully tomorrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Evgeniy Shishkin
Date:

> On Jan 10, 2018, at 21:45, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> The documentation for max_parallel_workers_maintenance cribs from the
> documentation for max_parallel_workers_per_gather in saying that we'll
> use fewer workers than expected "which may be inefficient". 

Can we actually call it max_parallel_maintenance_workers instead?
I mean we don't have work_mem_maintenance.



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Jan 10, 2018 at 3:29 PM, Evgeniy Shishkin <itparanoia@gmail.com> wrote:
>> On Jan 10, 2018, at 21:45, Robert Haas <robertmhaas@gmail.com> wrote:
>> The documentation for max_parallel_workers_maintenance cribs from the
>> documentation for max_parallel_workers_per_gather in saying that we'll
>> use fewer workers than expected "which may be inefficient".
>
> Can we actually call it max_parallel_maintenance_workers instead?
> I mean we don't have work_mem_maintenance.

Good point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 10, 2018 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The addition to README.parallel is basically wrong, because workers
> have been allowed to write WAL since the ParallelContext machinery.
> See the
> XactLastRecEnd handling in parallel.c.  Workers can, for example, due
> HOT cleanups during SELECT scans, just as the leader can.  The
> language here is obsolete anyway in light of commit
> e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way
> to update it.  I'll propose a separate patch for that.

WFM.

> The change to the ParallelContext signature in parallel.h makes an
> already-overlength line even longer.  A line break seems warranted
> just after the first argument, plus pgindent afterward.

Okay.

> I am not a fan of the leader-as-worker terminology.  The leader is not
> a worker, full stop.  I think we should instead talk about whether the
> leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates,
> for example, plus many comment updates).  Similarly, it seems
> SortCoordinateData's nLaunched should be nParticipants, and BTLeader's
> nworkertuplesorts should be nparticipanttuplesorts.

Okay.

> There is also the question of whether we want to respect
> parallel_leader_participation in this context.  The issues which might
> motivate the desire for such behavior in the context of a query do not
> exist when creating a btree index, so maybe we're just making this
> complicated. On the other hand, if some other type of parallel index
> build does end up doing a Gather-like operation then we might regret
> deciding that parallel_leader_participation doesn't apply to index
> builds, so maybe it's OK the way we have it.  On the third hand, the
> complexity of having the leader maybe-participate seems like it
> extends to a fair number of places in the code, and getting rid of all
> that complexity seems appealing.

I only added support for the leader-as-worker case because I assumed
that it was important to have CREATE INDEX process allocation work
"analogously" to parallel query, even though it's clear that the two
situations are not really completely comparable when you dig deep
enough. Getting rid of the leader participating as a worker has
theoretical downsides, but real practical upsides. I am also tempted
to just get rid of it.

> One place where this actually causes a problem is the message changes
> to index_build().  The revised ereport() violates translatability
> guidelines, which require that messages not be assembled from pieces.
> See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES

Noted. Another place where a worker Tuplesortstate in the leader
process causes problems is plan_create_index_workers(), especially
because of things like force_parallel_mode and
parallel_leader_participation.

> A comment added to tuplesort.h says that work_mem should be at least
> 64KB, but does not give any reason.  I think one should be given, at
> least briefly, so that someone looking at these comments in the future
> can, for example, figure out whether the comment is still correct
> after future code changes.  Or else, remove the comment.

The reason for needing to do this is that a naive division of
work_mem/maintenance_work_mem within a caller like nbtsort.c, could,
in general, result in a workMem that is as low as 0 (due to integer
truncation of the result of a division). Clearly *that* is too low. In
fact, we need at least enough memory to store the initial minimal
memtuples array, which needs to respect ALLOCSET_SEPARATE_THRESHOLD.
There is also the matter of having per-tape space for
TAPE_BUFFER_OVERHEAD when we spill to disk (note also the special case
for pass-by-value datum sorts low on memory). There have been a couple
of unavoidable OOM bugs in tuplesort over the years already.

How about I remove the comment, but have tuplesort_begin_common()
force each Tuplesortstate to have workMem that is at least 64KB
(minimum legal work_mem value) in all cases? We can just formalize the
existing assumption that workMem cannot go below 64KB, really, and it
isn't reasonably to use so little workMem within a parallel worker (it
should be prevented by plan_create_index_workers() in the real world,
where parallelism is never artificially forced).

There is no need to make this complicated by worrying about whether or
not 64KB is the true minimum (value that avoids "can't happen"
errors), IMV.

> + * Parallel sort callers are required to coordinate multiple tuplesort states
> + * in a leader process, and one or more worker processes.  The leader process
>
> I think the comma should be removed.  As written it, it looks like we
> are coordinating multiple tuplesort states in a leader process, and,
> separately, we are coordinating one or more worker processes.

Okay.

> Generally, I think the comments in tuplesort.h are excellent.

Thanks.

> I really like the overview of how the new interfaces should be used,
> although I find it slightly wonky that the leader needs two separate
> Tuplesortstates if it wants to participate.

Assuming that we end up actually allowing the leader to participate as
a worker at all, then I think that having that be a separate
Tuplesortstate is better than the alternative. There are a couple of
places where I can see it mattering. For one thing, dtrace compatible
traces become more complicated -- LogicalTapeSetBlocks() is reported
to dtrace within workers (though not via trace_sort logging, where it
is considered redundant). For another, I think we'd need to have
multiple tapesets at the same time for the leader if it only had one
Tuplesortstate, which means multiple new Tuplesortstate fields.

In short, having a distinct Tuplesortstate means almost no special
cases. Maybe you find it slightly wonky because parallel CREATE INDEX
really does have the leader participate as a worker with minimal
caveats. It will do just as much work as a real parallel worker
process, which really is quite a new thing, in a way.

> I don't understand why this patch needs to tinker with the tests in
> vacuum.sql.  The comments say that "If we did not do this, errors
> raised would concern running ANALYZE in parallel mode."  However, why
> should parallel CREATE INDEX having any impact on ANALYZE at all?
> Also, as a practical matter, if I revert those changes, 'make check'
> still passes with or without force_parallel_mode=on.

This certain wasn't true before now -- parallel CREATE INDEX could
previously cause the test to give different output for one error
message. I'll revert that change.

I imagine (though haven't verified) that this happened because, as you
pointed out separately, I didn't get the memo about e9baa5e9 (this is
the commit you mentioned in relation to README.parallel/parallel write
DML).

> I really dislike the fact that this patch invents another thing for
> force_parallel_mode to do.  I invented force_parallel_mode mostly as a
> way of testing that functions were correctly labeled for
> parallel-safety, and I think it would be just fine if it never does
> anything else.

This is not something that I feel strongly about, though I think it is
useful to test parallel CREATE INDEX in low memory conditions, one way
or another.

> I don't think it will be a
> good thing for PostgreSQL if we end up with force_parallel_mode=on as
> a general "use parallelism even though it's stupid" flag, requiring
> supporting code in many different places throughout the code base and
> a laundry list of not-actually-useful behavior changes in the
> documentation.

I will admit that "use parallelism even though it's stupid" is how I
thought of force_parallel_mode=on. I thought of it as a testing option
that users shouldn't need to concern themselves with in almost all
cases. I am not at all attached to what I did with
force_parallel_mode, except that it provides some way to test low
memory conditions, and it was something that I thought you'd expect
from this patch.

> What I think would be a lot more useful, and what I sort of expected
> the patch to have, is a way for a user to explicitly control the
> number of workers requested for a CREATE INDEX operation.

I tend to agree. It wouldn't be *very* compelling, because there
doesn't seem to be that much to how many workers are used anyway, but
it's worth having.

> We all know
> that the cost model is crude and that may be OK -- though it would be
> interesting to see some research on what the run times actually look
> like for various numbers of workers at various table sizes and
> work_mem settings -- but it will be inconvenient for DBAs who actually
> know what number of workers they want to use to instead get whatever
> value plan_create_index_workers() decide to emit.

I did a lot of unpublished research on this over a year ago, and
noticed nothing strange then. I guess I could use the box that
Postgres Pro provided me with access to to revisit it.

> They can force it
> by setting the parallel_workers reloption, but that affects queries.
> They can probably also do it by setting min_parallel_table_scan_size =
> 0 and  max_parallel_workers_maintenance to whatever value they want,
> but I think it would be convenient for there to be a more
> straightforward way to do it, or at least some documentation in the
> CREATE INDEX page about how to get the number of workers you really
> want.  To be clear, I don't think that this is a must-fix issue for
> this patch to get committed, but I do think that all reference to
> force_parallel_mode=on should go away.

The only reason I didn't add a "just use this many parallel workers"
option myself already is that doing so introduces awkward ambiguities.
Long ago, there was a parallel_workers index storage param added by
the patch, which you didn't like because it confused the issue in just
the same way as the table parallel_workers storage param does now,
would have confused parallel index scan, and so on. I counter-argued
that though this was ugly, it seemed to be how it worked on other
systems (more of an explanation than an argument, actually, because I
find it hard to know what to do here).

You're right that there should be a way to simply force the number of
parallel workers for DDL commands that use parallelism. You're also
right to be concerned about that not being a storage parameter (index
or otherwise), because that modifies run time behavior in a surprising
way (even if this pitfall *is* actually something that users of SQL
Server and Oracle have to live with).  Adding something to the CREATE
INDEX grammar just for this *also* seems confusing, because users will
think that it is a storage parameter even though it isn't (I'm pretty
sure that almost no Postgres user can give you a definition of a
storage parameter without some prompting).

I share your general feelings on all of this, but I really don't know
what to do about it. Which of these alternatives is the least worst,
all things considered?

> I do not like the way that this patch wants to turn the section of the
> documentation on when parallel query can be used into a discussion of
> when parallelism can be used.  I think it would be better to leave
> that section alone and instead document under CREATE INDEX the
> concerns specific to parallel index build. I think this will be easier
> for users to understand and far easier to maintain as the number of
> parallel DDL operations increases, which I expect it to do somewhat
> explosively.

WFM.

> The documentation for max_parallel_workers_maintenance cribs from the
> documentation for max_parallel_workers_per_gather in saying that we'll
> use fewer workers than expected "which may be inefficient".  However,
> for parallel CREATE INDEX, that trailing clause is, at least as far as
> I can see, not applicable.

Fair point. Will revise.

> (Various points on phrasing and punctuation)

That all seems fine.

> + * To support parallel sort operations involving coordinated callers to
> + * tuplesort.c routines across multiple workers, it is necessary to
> + * concatenate each worker BufFile/tapeset into one single leader-wise
> + * logical tapeset.  Workers should have produced one final materialized
> + * tape (their entire output) when this happens in leader; there will always
> + * be the same number of runs as input tapes, and the same number of input
> + * tapes as workers.
>
> I can't interpret the word "leader-wise".  A partition-wise join is a
> join done one partition at a time, but a leader-wise logical tape set
> is not done one leader at a time.  If there's another meaning to the
> affix -wise, I'm not familiar with it.  Don't we just mean "a single
> logical tapeset managed by the leader"?

Yes, we do. Will change.

> There's a lot here I haven't grokked yet, but I'm running out of
> mental energy so I think I'll send this for now and work on this some
> more when time permits, hopefully tomorrow.

The good news is that the things that you took issue with were about
what I expected you to take issue with. You seem to be getting through
the review of this patch very efficiently.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 10, 2018 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Can we actually call it max_parallel_maintenance_workers instead?
>> I mean we don't have work_mem_maintenance.
>
> Good point.

WFM.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Jan 10, 2018 at 5:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> How about I remove the comment, but have tuplesort_begin_common()
> force each Tuplesortstate to have workMem that is at least 64KB
> (minimum legal work_mem value) in all cases? We can just formalize the
> existing assumption that workMem cannot go below 64KB, really, and it
> isn't reasonably to use so little workMem within a parallel worker (it
> should be prevented by plan_create_index_workers() in the real world,
> where parallelism is never artificially forced).

+1.  I think this doesn't even need to be documented.  You can simply
write a comment that says something /* Always allow each worker to use
at least 64kB.  If the amount of memory allowed for the sort is very
small, this might technically cause us to exceed it, but since it's
tiny compared to the overall memory cost of running a worker in the
first place, it shouldn't matter. */

> I share your general feelings on all of this, but I really don't know
> what to do about it. Which of these alternatives is the least worst,
> all things considered?

Let's get the patch committed without any explicit way of forcing the
number of workers and then think about adding that later.

It will be good if you and Rushabh can agree on who will produce the
next version of this patch, and also if I have some idea when that
version should be expected.  On another point, we will need to agree
on how this should be credited in an eventual commit message.  I do
not agree with adding Heikki as an author unless he contributed code,
but we can credit him in some other way, like "Thanks are also due to
Heikki Linnakangas for significant improvements to X, Y, and Z that
made this patch possible."  I assume the author credit will be "Peter
Geoghegan, Rushabh Lathia" in that order, but let me know if anyone
thinks that isn't the right idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Alvaro Herrera
Date:
Robert Haas wrote:

> + * To support parallel sort operations involving coordinated callers to
> + * tuplesort.c routines across multiple workers, it is necessary to
> + * concatenate each worker BufFile/tapeset into one single leader-wise
> + * logical tapeset.  Workers should have produced one final materialized
> + * tape (their entire output) when this happens in leader; there will always
> + * be the same number of runs as input tapes, and the same number of input
> + * tapes as workers.
> 
> I can't interpret the word "leader-wise".  A partition-wise join is a
> join done one partition at a time, but a leader-wise logical tape set
> is not done one leader at a time.  If there's another meaning to the
> affix -wise, I'm not familiar with it.  Don't we just mean "a single
> logical tapeset managed by the leader"?

https://www.merriam-webster.com/dictionary/-wise
-wise
adverb combining form
Definition of -wise
1 a : in the manner of crabwise fanwise
b : in the position or direction of slantwise clockwise
2 : with regard to : in respect of dollarwise

I think "one at a time" is not the right way to interpret the affix.
Rather, a "partitionwise join" is a join done "in the manner of
partitions", that is, the characteristics of the partitions are
considered when the join is done.

I'm not defending the "leader-wise" term here, though, because I can't
make sense of it, regardless of how I interpret the -wise affix.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 10, 2018 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I share your general feelings on all of this, but I really don't know
>> what to do about it. Which of these alternatives is the least worst,
>> all things considered?
>
> Let's get the patch committed without any explicit way of forcing the
> number of workers and then think about adding that later.

It could be argued that you need some way of forcing low memory in
workers with any committed version. So while this sounds reasonable,
it might not be compatible with throwing out what I've done with
force_parallel_mode up-front, before you commit anything. What do you
think?

> It will be good if you and Rushabh can agree on who will produce the
> next version of this patch, and also if I have some idea when that
> version should be expected.

I'll take it.

> On another point, we will need to agree
> on how this should be credited in an eventual commit message.  I do
> not agree with adding Heikki as an author unless he contributed code,
> but we can credit him in some other way, like "Thanks are also due to
> Heikki Linnakangas for significant improvements to X, Y, and Z that
> made this patch possible."

I agree that I should have been more nuanced with this. Here's what I intended:

Heikki is not the author of any of the code in the final commit, but
he is morally a (secondary) author of the feature as a whole, and
should be credited as such within the final release notes. This is
justified by the history here, which is that he was involved with the
patch fairly early on, and did some work that was particularly
important to the feature, that almost certainly would not otherwise
have happened. Sure, it helped the serial case too, but much less so.
That's really not why he did it.

> I assume the author credit will be "Peter
> Geoghegan, Rushabh Lathia" in that order, but let me know if anyone
> thinks that isn't the right idea.

"Peter Geoghegan, Rushabh Lathia" seems right. Thomas did write a very
small amount of the actual code, but I think it was more of a review
thing (he is already credited as a reviewer).

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 10, 2018 at 2:36 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I think "one at a time" is not the right way to interpret the affix.
> Rather, a "partitionwise join" is a join done "in the manner of
> partitions", that is, the characteristics of the partitions are
> considered when the join is done.
>
> I'm not defending the "leader-wise" term here, though, because I can't
> make sense of it, regardless of how I interpret the -wise affix.

I've already conceded the point, but fwiw "leader-wise" comes from the
idea of having a leader-wise space following concatenating worker
tapes (who have original/worker-wise space). We must apply an offset
to get from a worker-wise offset to a leader-wise offset.

This made more sense in an earlier version. I overlooked this during
recent self review.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Thu, Jan 11, 2018 at 11:42 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> "Peter Geoghegan, Rushabh Lathia" seems right. Thomas did write a very
> small amount of the actual code, but I think it was more of a review
> thing (he is already credited as a reviewer).

+1

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Rushabh Lathia
Date:


On Thu, Jan 11, 2018 at 3:35 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jan 10, 2018 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Can we actually call it max_parallel_maintenance_workers instead?
>> I mean we don't have work_mem_maintenance.
>
> Good point.

WFM.


This is good point. I agree with max_parallel_maintenance_workers. 
 
--
Peter Geoghegan



--
Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Jan 10, 2018 at 5:42 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 10, 2018 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I share your general feelings on all of this, but I really don't know
>>> what to do about it. Which of these alternatives is the least worst,
>>> all things considered?
>>
>> Let's get the patch committed without any explicit way of forcing the
>> number of workers and then think about adding that later.
>
> It could be argued that you need some way of forcing low memory in
> workers with any committed version. So while this sounds reasonable,
> it might not be compatible with throwing out what I've done with
> force_parallel_mode up-front, before you commit anything. What do you
> think?

I think the force_parallel_mode thing is too ugly to live.  I'm not
sure that forcing low memory in workers is a thing we need to have,
but if we do, then we'll have to invent some other way to have it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 11, 2018 at 11:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think the force_parallel_mode thing is too ugly to live.  I'm not
> sure that forcing low memory in workers is a thing we need to have,
> but if we do, then we'll have to invent some other way to have it.

It might make sense to have the "minimum memory per participant" value
come from a GUC, rather than be hard coded (it's currently hard-coded
to 32MB). I don't think that it's that compelling as a user-visible
option, but it might make sense as a testing option, that we might
very well decide to kill before v11 is released (we might kill it when
we come up with an acceptable interface for "just use this many
workers" in a later commit, which I think we'll definitely end up
doing anyway). By setting the minimum participant memory to 0, you can
then rely on the parallel_workers table storage param forcing the
number of worker processes that we'll request. You can accomplish the
same thing with "min_parallel_table_scan_size = 0", of course.

What do you think of that idea?

To be clear, I'm not actually arguing that we need any of this. My
point about being able to test low memory conditions from the first
commit is that insisting on it is reasonable. I don't actually feel
strongly either way, though, and am not doing any insisting myself.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 11, 2018 at 12:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> It might make sense to have the "minimum memory per participant" value
> come from a GUC, rather than be hard coded (it's currently hard-coded
> to 32MB).

> What do you think of that idea?

A third option here is to specifically recognize that
compute_parallel_worker() returned a value based on the table storage
param max_workers, and for that reason alone no "insufficient memory
per participant" decrementing/vetoing should take place. That is, when
the max_workers param is set, perhaps it should be completely
impossible for CREATE INDEX to ignore it for any reason other than an
inability to launch parallel workers (though that could be due to the
max_parallel_workers GUC's setting).

You could argue that we should do this anyway, I suppose.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Jan 11, 2018 at 3:25 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jan 11, 2018 at 12:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> It might make sense to have the "minimum memory per participant" value
>> come from a GUC, rather than be hard coded (it's currently hard-coded
>> to 32MB).
>
>> What do you think of that idea?
>
> A third option here is to specifically recognize that
> compute_parallel_worker() returned a value based on the table storage
> param max_workers, and for that reason alone no "insufficient memory
> per participant" decrementing/vetoing should take place. That is, when
> the max_workers param is set, perhaps it should be completely
> impossible for CREATE INDEX to ignore it for any reason other than an
> inability to launch parallel workers (though that could be due to the
> max_parallel_workers GUC's setting).
>
> You could argue that we should do this anyway, I suppose.

Yes, I think this sounds like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 11, 2018 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> A third option here is to specifically recognize that
>> compute_parallel_worker() returned a value based on the table storage
>> param max_workers, and for that reason alone no "insufficient memory
>> per participant" decrementing/vetoing should take place. That is, when
>> the max_workers param is set, perhaps it should be completely
>> impossible for CREATE INDEX to ignore it for any reason other than an
>> inability to launch parallel workers (though that could be due to the
>> max_parallel_workers GUC's setting).
>>
>> You could argue that we should do this anyway, I suppose.
>
> Yes, I think this sounds like a good idea.

Cool. I've already implemented this in my local working copy of the
patch. That settles that.

If I'm not mistaken, the only outstanding question at this point is
whether or not we're going to give in and completely remove parallel
leader participation entirely. I suspect that we won't end up doing
that, because while it's not very useful, it's also not hard to
support. Besides, to some extent that's the expectation that has been
established already.

I am not far from posting a revision that incorporates all of your
feedback. Expect that tomorrow afternoon your time at the latest. Of
course, you may have more feedback for me in the meantime. Let me know
if I should hold off on posting a new version.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Jan 10, 2018 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> There's a lot here I haven't grokked yet, but I'm running out of
> mental energy so I think I'll send this for now and work on this some
> more when time permits, hopefully tomorrow.

Looking at the logtape changes:

While the patch contains, as I said before, an excellent set of how-to
directions explaining how to use the new parallel sort facilities in
tuplesort.c, there seems to be no such thing for logtape.c, and as a
result I find it a bit unclear how the interface is supposed to work.
I think it would be good to add a similar summary here.

It seems like the words "leader" and "worker" here refer to the leader
of a parallel operation and the associated workers, but do we really
need to make that assumption?  Couldn't we generally describe this as
merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet
into a single LogicalTapeSet that can thereafter be read by the
process that does the merging?

+    /* Pass worker BufFile pieces, and a placeholder leader piece */
+    for (i = 0; i < lts->nTapes; i++)
+    {
+        lt = <s->tapes[i];
+
+        /*
+         * Build concatenated view of all BufFiles, remembering the block
+         * number where each source file begins.
+         */
+        if (i < lts->nTapes - 1)

Unless I'm missing something, the "if" condition just causes the last
pass through this loop to do nothing.  If so, why not just change the
loop condition to i < lts->nTapes - 1 and drop the "if" statement
altogether?

+            char        filename[MAXPGPATH] = {0};

I don't think you need = {0}, because pg_itoa is about to clobber it anyway.

+            /* Alter worker's tape state (generic values okay for leader) */

What do you mean by generic values?

+ * Each tape is initialized in write state.  Serial callers pass ntapes, but
+ * NULL arguments for everything else.  Parallel worker callers pass a
+ * shared handle and worker number, but tapeset should be NULL.  Leader
+ * passes worker -1, a shared handle, and shared tape metadata. These are
+ * used to claim ownership of worker tapes.

This comment doesn't match the actual function definition terribly
well.  Serial callers don't pass NULL for "everything else", because
"int worker" is not going to be NULL.  For parallel workers, it's not
entirely obvious whether "a shared handle" means TapeShare *tapes or
SharedFileSet *fileset.  "tapeset" sounds like an argument name, but
there is no such argument.

lt->max_size looks like it might be an optimization separate from the
overall patch, but maybe I'm wrong about that.

+        /* palloc() larger than MaxAllocSize would fail */
         lt->buffer = NULL;
         lt->buffer_size = 0;
+        lt->max_size = MaxAllocSize;

The comment about palloc() should move down to where you assign max_size.

Generally we avoid returning a struct type, so maybe
LogicalTapeFreeze() should instead grow an out parameter of type
TapeShare * which it populates only if not NULL.

Won't LogicalTapeFreeze() fail an assertion in BufFileExportShared()
if the file doesn't belong to a shared fileset?  If you adopt the
previous suggestion, we can probably just make whether to call this
contingent on whether the TapeShare * out parameter is provided.

I'm not confident I completely understand what's going on with the
logtape stuff yet, so I might have more comments (or better ones)
after I study this further.  To your question about whether to go
ahead and post a new version, I'm OK to keep reviewing this version
for a little longer or to switch to a new one, as you prefer.  I have
not made any local changes, just written a blizzard of email text.
:-p

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 11, 2018 at 2:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> While the patch contains, as I said before, an excellent set of how-to
> directions explaining how to use the new parallel sort facilities in
> tuplesort.c, there seems to be no such thing for logtape.c, and as a
> result I find it a bit unclear how the interface is supposed to work.
> I think it would be good to add a similar summary here.

Okay. I came up with something for that.

> It seems like the words "leader" and "worker" here refer to the leader
> of a parallel operation and the associated workers, but do we really
> need to make that assumption?  Couldn't we generally describe this as
> merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet
> into a single LogicalTapeSet that can thereafter be read by the
> process that does the merging?

It's not so much an assumption as it is the most direct way of
referring to these various objects. logtape.c is very clearly a
submodule of tuplesort.c, so this felt okay to me. There are already
several references to what tuplesort.c expects. I'm not going to argue
about it if you insist on this, though I do think that trying to
describe things in more general terms would be a net loss. It would
kind of come off as feigning ignorance IMV. There is nothing that
logtape.c could know less about other than names/roles, and I find it
hard to imagine those changing, even when we add support for
partitioning/distribution sort (where logtape.c handles
"redistribution", something discussed early in this project's
lifetime).

> +    /* Pass worker BufFile pieces, and a placeholder leader piece */
> +    for (i = 0; i < lts->nTapes; i++)
> +    {
> +        lt = <s->tapes[i];
> +
> +        /*
> +         * Build concatenated view of all BufFiles, remembering the block
> +         * number where each source file begins.
> +         */
> +        if (i < lts->nTapes - 1)
>
> Unless I'm missing something, the "if" condition just causes the last
> pass through this loop to do nothing.  If so, why not just change the
> loop condition to i < lts->nTapes - 1 and drop the "if" statement
> altogether?

The last "lt" in the loop is in fact used separately, just outside the
loop. But that use turns out to have been subtly wrong, apparently due
to a problem with converting logtape.c to use the shared buffile
stuff. This buglet would only have caused writing to the leader tape
to break (never trace_sort instrumentation), something that isn't
supported anyway due to the restrictions that shared BufFiles have.
But, we should, on general principle, be able to write to the leader
tape if and when shared buffiles learn to support writing (after
exporting original BufFile in worker).

Buglet fixed in my local working copy. I did so in a way that changes
loop test along the lines you suggest. This should make the whole
design of tape concatenation a bit clearer.

> +            char        filename[MAXPGPATH] = {0};
>
> I don't think you need = {0}, because pg_itoa is about to clobber it anyway.

Okay.

> +            /* Alter worker's tape state (generic values okay for leader) */
>
> What do you mean by generic values?

I mean that the leader's tape doesn't need to have
lt->firstBlockNumber set, because it's empty -- it can remain -1. Same
applies to lt->offsetBlockNumber, too.

I'll remove the text within parenthesis, since it seems redundant
given the structure of the loop.

> + * Each tape is initialized in write state.  Serial callers pass ntapes, but
> + * NULL arguments for everything else.  Parallel worker callers pass a
> + * shared handle and worker number, but tapeset should be NULL.  Leader
> + * passes worker -1, a shared handle, and shared tape metadata. These are
> + * used to claim ownership of worker tapes.
>
> This comment doesn't match the actual function definition terribly
> well.  Serial callers don't pass NULL for "everything else", because
> "int worker" is not going to be NULL.  For parallel workers, it's not
> entirely obvious whether "a shared handle" means TapeShare *tapes or
> SharedFileSet *fileset.  "tapeset" sounds like an argument name, but
> there is no such argument.

Okay. I've tweaked things here.

> lt->max_size looks like it might be an optimization separate from the
> overall patch, but maybe I'm wrong about that.

I think that it's pretty much essential. Currently, the MaxAllocSize
restriction is needed in logtape.c for the same reason that it's
needed anywhere else. Not much to talk about there. The new max_size
thing is about more than that, though -- it's really about not
stupidly allocating up to a full MaxAllocSize when you already know
that you're going to use next to no memory.

You don't have this issue with serial sorts because serial sorts that
only sort a tiny number of tuples never end up as external sorts --
when you end up doing a serial external sort, clearly you're never
going to allocate an excessive amount of memory up front in logtape.c,
because you are by definition operating in a memory constrained
fashion. Not so for parallel external tuplesorts. Think spool2 in a
parallel unique index build, in the case where there are next to no
recently dead tuples (the common case).

> +        /* palloc() larger than MaxAllocSize would fail */
>          lt->buffer = NULL;
>          lt->buffer_size = 0;
> +        lt->max_size = MaxAllocSize;
>
> The comment about palloc() should move down to where you assign max_size.

Okay.

> Generally we avoid returning a struct type, so maybe
> LogicalTapeFreeze() should instead grow an out parameter of type
> TapeShare * which it populates only if not NULL.

Okay. I've modified LogicalTapeFreeze(), adding a "share" output
argument and reverting to returning void, as before.

> Won't LogicalTapeFreeze() fail an assertion in BufFileExportShared()
> if the file doesn't belong to a shared fileset?  If you adopt the
> previous suggestion, we can probably just make whether to call this
> contingent on whether the TapeShare * out parameter is provided.

Oops, you're right. It will be taken care of by the
LogicalTapeFreeze() function change signature change you suggested.

> I'm not confident I completely understand what's going on with the
> logtape stuff yet, so I might have more comments (or better ones)
> after I study this further.  To your question about whether to go
> ahead and post a new version, I'm OK to keep reviewing this version
> for a little longer or to switch to a new one, as you prefer.  I have
> not made any local changes, just written a blizzard of email text.
> :-p

Great. Thanks.

I've caught up with you again. I just need to take a look at what I
came up with with fresh eyes, and maybe do some more testing.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>> I agree that plan_create_index_workers() needs to count the leader as a
>> normal worker for the CREATE INDEX.  So what you proposing is - when
>> parallel_leader_participation is true launch (return value of
>> compute_parallel_worker() - 1)
>> workers. true ?
>
> Almost. We need to not subtract one when only one worker is indicated
> by compute_parallel_worker(). I also added some new stuff there, to
> consider edge cases with the parallel_leader_participation GUC.
>
>>> I'm working on fixing up what you posted. I'm probably not more than a
>>> week away from posting a patch that I'm going to mark "ready for
>>> committer". I've already made the change above, and once I spend time
>>> on trying to break the few small changes needed within buffile.c I'll
>>> have taken it as far as I can, most likely.
>>>
>>
>> Okay, once you submit the patch with changes - I will do one round of
>> review for the changes.
>
> I've attached my revision. Changes include:
>

Few observations while skimming through the patch:

1.
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
  {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- OldestXmin = InvalidTransactionId; /* not used */
+ OldestXmin = GetOldestXmin(heapRelation, true);

I think leader and workers should have the same idea of oldestXmin for
the purpose of deciding the visibility of tuples.  I think this is
ensured in all form of parallel query as we do share the snapshot,
however, same doesn't seem to be true for Parallel Index builds.

2.
+
+ /* Wait on worker processes to finish (should be almost instant) */
+ reltuples = _bt_leader_wait_for_workers(buildstate);

Can't we use WaitForParallelWorkersToFinish for this purpose?  The
reason is that if we use a different mechanism here then we might need
a different way to solve the problem related to fork failure.  See
thread [1].  Basically, what if postmaster fails to launch workers due
to fork failure, the leader backend might wait indefinitely.



[1] - https://commitfest.postgresql.org/16/1341/


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 1.
> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>   {
> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
> - OldestXmin = InvalidTransactionId; /* not used */
> + OldestXmin = GetOldestXmin(heapRelation, true);
>
> I think leader and workers should have the same idea of oldestXmin for
> the purpose of deciding the visibility of tuples.  I think this is
> ensured in all form of parallel query as we do share the snapshot,
> however, same doesn't seem to be true for Parallel Index builds.

Hmm.  Does it break anything if they use different snapshots?  In the
case of a query that would be disastrous because then you might get
inconsistent results, but if the snapshot is only being used to
determine what is and is not dead then I'm not sure it makes much
difference ... unless the different snapshots will create confusion of
some other sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Jan 11, 2018 at 8:58 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I've caught up with you again. I just need to take a look at what I
> came up with with fresh eyes, and maybe do some more testing.

More comments:

BufFileView() looks fairly pointless.  It basically creates a copy of
the input and, in so doing, destroys the input, which is a lot like
returning the input parameter except that it uses more cycles.  It
does do a few things.  First, it zeroes the offsets array instead of
copying the offsets.  But as used, those offsets would have been 0
anyway.  Second, it sets the fileset parameter to NULL.   But that
doesn't actually seem to be important for anything: the fileset is
only used when creating new files, and the BufFile must already be
marked read-only, so we won't be doing that.  It seems like this
function can just be entirely removed and replaced by Assert()-ing
some things about the target in BufFileViewAppend, which I would just
rename to BufFileAppend.

In miscadmin.h, I'd put the prototype for the new GUC next to
max_worker_processes, not maintenance_work_mem.

The ereport() in index_build will, I think, confuse people when it
says that there are 0 parallel workers.  I suggest splitting this into
two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
"building index \"%s\" on table \"%s\" serially" ...) else ereport(...
"building index \"%s\" on table \"%s\" in parallel with request for %d
parallel workers" ...).  Might even need three cases to handle
parallel_leader_participation without needing to assemble the message,
unless we drop parallel_leader_participation support.

The logic in IndexBuildHeapRangeScan() around need_register_snapshot
and OldestXmin seems convoluted and not very well-edited to me.  For
example, need_register_snapshot is set to false in a block that is
only entered when it's already false, and the comment that follows is
supposed to be associated with GetOldestXmin() and makes no sense
here.  I suggest that you go back to the original code organization
and then just insert an additional case for a caller-supplied scan, so
that the overall flow looks like this:

if (scan != NULL)
{
   ...
}
else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
{
   ...
}
else
{
   ...
}

Along with that, I'd change the name of need_register_snapshot to
need_unregister_snapshot (it's doing both jobs right now) and
initialize it to false.  If you enter the second of the above blocks
then change it to true just after snapshot =
RegisterSnapshot(GetTransactionSnapshot()).  Then adjust the comment
that begins "Prepare for scan of the base relation." by inserting an
additional sentence just after that one: "If the caller has supplied a
scan, just use it.  Otherwise, in a normal index build..." and the
rest as it is currently.

+ * This support code isn't reliable when called from within a parallel
+ * worker process due to the fact that our state isn't propagated.  This is
+ * why parallel index builds are disallowed on catalogs.  It is possible
+ * that we'll fail to catch an attempted use of a user index undergoing
+ * reindexing due the non-propagation of this state to workers, which is not
+ * ideal, but the problem is not particularly likely to go undetected due to
+ * our not doing better there.

I understand the first two sentences, but I have no idea what the
third one means, especially the part that says "not particularly
likely to go undetected due to our not doing better there".  It sounds
scary that something bad is only "not particularly likely to go
undetected"; don't we need to detect bad things reliably? But also,
you used the word "not" three times and also the prefix "un-", meaning
"not", once.  Four negations in 13 words! Perhaps I'm not entirely in
a position to cast aspersions on overly-complex phraseology -- the pot
calling the kettle black and all that -- but I bet that will be a lot
clearer if you reduce the number of negations to either 0 or 1.

The comment change in standard_planner() doesn't look helpful to me;
I'd leave it out.

+ * tableOid is the table that index is to be built on.  indexOid is the OID
+ * of a index to be created or reindexed (which must be a btree index).

I'd rewrite that first sentence to end "the table on which the index
is to be built".  The second sentence should say "an index" rather
than "a index".

+ * leaderWorker indicates whether leader will participate as worker or not.
+ * This needs to be taken into account because leader process is guaranteed to
+ * be idle when not participating as a worker, in contrast with conventional
+ * parallel relation scans, where the leader process typically has plenty of
+ * other work to do relating to coordinating the scan, etc.  For CREATE INDEX,
+ * leader is usually treated as just another participant for our scaling
+ * calculation.

OK, I get the first sentence.  But the rest of this appears to be
partially irrelevant and partially incorrect.  The degree to which the
leader is likely to be otherwise occupied isn't very relevant; as long
as we think it's going to do anything at all, we have to account for
it somehow.  Also, the issue isn't that in a query the leader would be
busy "coordinating the scan, etc." but rather that it would have to
read the tuples produced by the Gather (Merge) node.  I think you
could just delete everything from "This needs to be..." through the
end.  You can cover the details of how it's used closer to the point
where you do anything with leaderWorker (or, as I assume it will soon
be, leaderParticipates).

But, actually, I think we would be better off just ripping
leaderWorker/leaderParticipates out of this function altogether.
compute_parallel_worker() is not really under any illusion that it's
computing a number of *participants*; it's just computing a number of
*workers*.  Deducting 1 when the leader is also participating but only
when at least 2 workers were computed leads to an oddity: for a
regular parallel sequential scan, the number of workers increases by 1
when the table size increases by a factor of 3, but here, the number
of workers increases from 1 to 2 when the table size increases by a
factor of 9, and then by 1 for every further multiple of 3.  There
doesn't seem to be any theoretical or practical justification for such
behavior, or with being inconsistent with what parallel sequential
scan does otherwise.  I think it's fine for
parallel_leader_participation=off to simply mean that you get one
fewer participants.  That's actually what would happen with parallel
query, too.  Parallel query would consider
parallel_leader_participation later, in get_parallel_divisor(), when
working out the cost of one path vs. another, but it doesn't use it to
choose the number of workers.  So it seems to me that getting rid of
all of the workerLeader considerations will make it both simpler and
more consistent with what we do for queries.

To be clear, I don't think there's any real need for the cost model we
choose for CREATE INDEX to be the same as the one we use for regular
scans.  The problem with regular scans is that it's very hard to
predict how many workers we can usefully use; it depends not only on
the table size but on what plan nodes get stacked on top of it higher
in the plan tree.  In a perfect world we'd like to add as many workers
as required to avoid having the query be I/O bound and then stop, but
that would require both the ability to predict future system
utilization and a heck of a lot more knowledge than the planner can
hope to have at this point.  If you have an idea how to make a better
cost model than this for CREATE INDEX, I'm willing to consider other
options.  If you don't, or want to propose that as a follow-up patch,
then I think it's OK to use what you've got here for starters.  I just
don't want it to be more baroque than necessary.

I think that the naming of the wait events could be improved.  Right
now, they are named by which kind of process does the waiting, but it
really should be named based on what the thing for which we're
waiting.  I also suggest that we could just write Sort instead of
Tuplesort. In short, I suggest ParallelTuplesortLeader ->
ParallelSortWorkersDone and ParallelTuplesortLeader ->
ParallelSortTapeHandover.

Not for this patch, but I wonder if it might be a worthwhile future
optimization to allow workers to return multiple tapes to the leader.
One doesn't want to go crazy with this, of course.  If the worker
returns 100 tapes, then the leader might get stuck doing multiple
merge passes, which would be a foolish way to divide up the labor, and
even if that doesn't happen, Amdahl's law argues for minimizing the
amount of work that is not done in parallel.  Still, what if a worker
(perhaps after merging) ends up with 2 or 3 tapes?  Is it really worth
merging them so that the leader can do a 5-way merge instead of a
15-way merge?  Maybe this case is rare in practice, because multiple
merge passes will be uncommon with reasonable values of work_mem, and
it might be silly to go to the trouble of firing up workers if they'll
only generate a few runs in total.  Just a thought.

+     * Make sure that the temp file(s) underlying the tape set are created in
+     * suitable temp tablespaces.  This is only really needed for serial
+     * sorts.

This comment makes me wonder whether it is "sorta" needed for parallel sorts.

-    if (trace_sort)
+    if (trace_sort && !WORKER(state))

I have a feeling we still want to get this output even from workers,
but maybe I'm missing something.

+      arg5 indicates serial, parallel worker, or parallel leader sort.</entry>

I think it should say what values are used for each case.

+    /* Release worker tuplesorts within leader process as soon as possible */

IIUC, the worker tuplesorts aren't really holding onto much of
anything in terms of resources.  I think it might be better to phrase
this as /* The sort we just did absorbed the final tapes produced by
these tuplesorts, which are of no further use. */ or words to that
effect.

Instead of making a special case in CreateParallelContext for
serializable_okay, maybe index_build should just use SetConfigOption()
to force the isolation level to READ COMMITTED right after it does
NewGUCNestLevel().  The change would only be temporary because the
subsequent call to AtEOXact_GUC() will revert it.  The point isn't
really that CREATE INDEX is somehow exempt from the problem that
SIREAD locks haven't been updated to work correctly with parallelism;
it's that CREATE INDEX itself is defined to ignore serializability
concerns.

There is *still* more to review here, but my concentration is fading.
If you could post an updated patch after adjusting for the comments
above, I think that would be helpful.  I'm not totally out of things
to review that I haven't already looked over once, but I think I'm
close.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> 1.
>> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>>   {
>> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
>> - OldestXmin = InvalidTransactionId; /* not used */
>> + OldestXmin = GetOldestXmin(heapRelation, true);
>>
>> I think leader and workers should have the same idea of oldestXmin for
>> the purpose of deciding the visibility of tuples.  I think this is
>> ensured in all form of parallel query as we do share the snapshot,
>> however, same doesn't seem to be true for Parallel Index builds.
>
> Hmm.  Does it break anything if they use different snapshots?  In the
> case of a query that would be disastrous because then you might get
> inconsistent results, but if the snapshot is only being used to
> determine what is and is not dead then I'm not sure it makes much
> difference ... unless the different snapshots will create confusion of
> some other sort.

I think that this is fine. GetOldestXmin() is only used when we have a
ShareLock on the heap relation, and the snapshot is SnapshotAny. We're
only talking about the difference between HEAPTUPLE_DEAD and
HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't
strictly necessary by the time you got to it is normal.

However, it's not okay that GetOldestXmin()'s second argument is true
in the patch, rather than PROCARRAY_FLAGS_VACUUM. That's due to bitrot
that was not caught during some previous rebase (commit af4b1a08
changed the signature). Will fix.

You've given me a lot more to work through in your most recent mail,
Robert. I will probably get the next revision to you on Monday.
Doesn't seem like there is much point in posting what I've done so
far.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sat, Jan 13, 2018 at 1:25 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> 1.
>>> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>>>   {
>>> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
>>> - OldestXmin = InvalidTransactionId; /* not used */
>>> + OldestXmin = GetOldestXmin(heapRelation, true);
>>>
>>> I think leader and workers should have the same idea of oldestXmin for
>>> the purpose of deciding the visibility of tuples.  I think this is
>>> ensured in all form of parallel query as we do share the snapshot,
>>> however, same doesn't seem to be true for Parallel Index builds.
>>
>> Hmm.  Does it break anything if they use different snapshots?  In the
>> case of a query that would be disastrous because then you might get
>> inconsistent results, but if the snapshot is only being used to
>> determine what is and is not dead then I'm not sure it makes much
>> difference ... unless the different snapshots will create confusion of
>> some other sort.
>
> I think that this is fine. GetOldestXmin() is only used when we have a
> ShareLock on the heap relation, and the snapshot is SnapshotAny. We're
> only talking about the difference between HEAPTUPLE_DEAD and
> HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't
> strictly necessary by the time you got to it is normal.
>

Yeah, but this would mean that now with parallel create index, it is
possible that some tuples from the transaction would end up in index
and others won't.   In general, this makes me slightly nervous mainly
because such a case won't be possible without the parallel option for
create index, but if you and Robert are okay with it as there is no
fundamental problem, then we might as well leave it as it is or maybe
add a comment saying so.


Another point is that the information about broken hot chains
indexInfo->ii_BrokenHotChain is getting lost.  I think you need to
coordinate this information among backends that participate in
parallel create index.  Test to reproduce the problem is as below:

create table tbrokenchain(c1 int, c2 varchar);
insert into tbrokenchain values(3, 'aaa');

begin;
set force_parallel_mode=on;
update tbrokenchain set c2 = 'bbb' where c1=3;
create index idx_tbrokenchain on tbrokenchain(c1);
commit;

Now, check the value of indcheckxmin in pg_index, it should be true,
but with patch it is false.  You can try with patch by not changing
the value of force_parallel_mode;

The patch uses both parallel_leader_participation and
force_parallel_mode, but it seems the definition is different from
what we have in Gather.  Basically, even with force_parallel_mode, the
leader is participating in parallel build. I see there is some
discussion above about both these parameters and still, there is not
complete agreement on the best way forward.  I think we should have
parallel_leader_participation as that can help in testing if nothing
else.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sat, Jan 13, 2018 at 6:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> The patch uses both parallel_leader_participation and
> force_parallel_mode, but it seems the definition is different from
> what we have in Gather.  Basically, even with force_parallel_mode, the
> leader is participating in parallel build. I see there is some
> discussion above about both these parameters and still, there is not
> complete agreement on the best way forward.  I think we should have
> parallel_leader_participation as that can help in testing if nothing
> else.
>

Or maybe just have force_parallel_mode.  I think one of these is
required to facilitate some form of testing of the parallel code
easily.  As you can see from my previous email, it was quite easy to
demonstrate a test with force_parallel_mode.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, but this would mean that now with parallel create index, it is
> possible that some tuples from the transaction would end up in index
> and others won't.

You mean some tuples from some past transaction that deleted a bunch
of tuples and committed, but not before someone acquired a still-held
snapshot that didn't see the deleter's transaction as committed yet?

I guess that that is different, but it doesn't matter. All that
matters is that in the end, the index contains all entries for all
heap tuples visible to any possible snapshot (though possibly
excluding some existing old snapshots iff we detect broken HOT chains
during builds).

> In general, this makes me slightly nervous mainly
> because such a case won't be possible without the parallel option for
> create index, but if you and Robert are okay with it as there is no
> fundamental problem, then we might as well leave it as it is or maybe
> add a comment saying so.

Let me try to explain this another way, in terms of the high-level
intuition that I have about it (Robert can probably skip this part).

GetOldestXmin() returns a value that is inherently a *conservative*
cut-off. In hot standby mode, it's possible for the value it returns
to go backwards from a value previously returned within the same
backend.

Even with serial builds, the exact instant that GetOldestXmin() gets
called could vary based on something like the OS scheduling of the
process that runs CREATE INDEX. It could have a different value based
only on that. It follows that it won't matter if parallel CREATE INDEX
participants have a slightly different value, because the cut-off is
all about the consistency of the index with what the universe of
possible snapshots could see in the heap, not the consistency of
different parts of the index with each other (the parts produced from
heap tuples read from each participant).

Look at how the pg_visibility module calls GetOldestXmin() to recheck
-- it has to call GetOldestXmin() a second time, with a buffer lock
held on a heap page throughout. It does this to conclusively establish
that the visibility map is corrupt (otherwise, it could just be that
the cut-off became stale).

Putting all of this together, it would be safe for the
HEAPTUPLE_RECENTLY_DEAD case within IndexBuildHeapRangeScan() to call
GetOldestXmin() again (a bit like pg_visibility does), to avoid having
to index an actually-fully-dead-by-now tuple (we could call
HeapTupleSatisfiesVacuum() a second time for the heap tuple, hoping to
get HEAPTUPLE_DEAD the second time around). This optimization wouldn't
work out a lot of the time (it would only work out when an old
snapshot went away during the CREATE INDEX), and would add
procarraylock traffic, so we don't do it. But AFAICT it's feasible.

> Another point is that the information about broken hot chains
> indexInfo->ii_BrokenHotChain is getting lost.  I think you need to
> coordinate this information among backends that participate in
> parallel create index.  Test to reproduce the problem is as below:
>
> create table tbrokenchain(c1 int, c2 varchar);
> insert into tbrokenchain values(3, 'aaa');
>
> begin;
> set force_parallel_mode=on;
> update tbrokenchain set c2 = 'bbb' where c1=3;
> create index idx_tbrokenchain on tbrokenchain(c1);
> commit;
>
> Now, check the value of indcheckxmin in pg_index, it should be true,
> but with patch it is false.  You can try with patch by not changing
> the value of force_parallel_mode;

Ugh, you're right. That's a real howler. Will fix.

Note that my stress-testing strategy has had a lot to do with
verifying that a serial build has relfiles that are physically
identical to parallel builds. Obviously that couldn't have caught
this, because this only concerns the state of the pg_index catalog.

> The patch uses both parallel_leader_participation and
> force_parallel_mode, but it seems the definition is different from
> what we have in Gather.  Basically, even with force_parallel_mode, the
> leader is participating in parallel build. I see there is some
> discussion above about both these parameters and still, there is not
> complete agreement on the best way forward.  I think we should have
> parallel_leader_participation as that can help in testing if nothing
> else.

I think that you're quite right that parallel_leader_participation
needs to be supported for testing purposes. I had some sympathy for
the idea that we should remove leader participation as a worker from
the patch entirely, but the testing argument seems to clinch it. I'm
fine with killing force_parallel_mode, though, because it will be
possible to force the use of parallelism by using the existing
parallel_workers table storage param in the next version of the patch,
regardless of how small the table is.

Thanks for the review.
-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sun, Jan 14, 2018 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Yeah, but this would mean that now with parallel create index, it is
>> possible that some tuples from the transaction would end up in index
>> and others won't.
>
> You mean some tuples from some past transaction that deleted a bunch
> of tuples and committed, but not before someone acquired a still-held
> snapshot that didn't see the deleter's transaction as committed yet?
>

I think I am talking about something different.  Let me try to explain
in some more detail.  Consider a transaction T-1 has deleted two
tuples from tab-1, first on page-1 and second on page-2 and committed.
There is a parallel transaction T-2 which has an open snapshot/query
due to which oldestXmin will be smaller than T-1.   Now, in another
session, we started parallel Create Index on tab-1 which has launched
one worker.  The worker decided to scan page-1 and will found that the
deleted tuple on page-1 is Recently Dead, so will include it in Index.
In the meantime transaction, T-2 got committed/aborted which allows
oldestXmin to be greater than the value of transaction T-1 and now
leader decides to scan the page-2 with freshly computed oldestXmin and
found that the tuple on that page is Dead and has decided not to
include it in the index.  So, this leads to a situation where some
tuples deleted by the transaction will end up in index whereas others
won't.  Note that I am not arguing that there is any fundamental
problem with this, but just want to highlight that such a case doesn't
seem to exist with Create Index.

>
>> The patch uses both parallel_leader_participation and
>> force_parallel_mode, but it seems the definition is different from
>> what we have in Gather.  Basically, even with force_parallel_mode, the
>> leader is participating in parallel build. I see there is some
>> discussion above about both these parameters and still, there is not
>> complete agreement on the best way forward.  I think we should have
>> parallel_leader_participation as that can help in testing if nothing
>> else.
>
> I think that you're quite right that parallel_leader_participation
> needs to be supported for testing purposes. I had some sympathy for
> the idea that we should remove leader participation as a worker from
> the patch entirely, but the testing argument seems to clinch it. I'm
> fine with killing force_parallel_mode, though, because it will be
> possible to force the use of parallelism by using the existing
> parallel_workers table storage param in the next version of the patch,
> regardless of how small the table is.
>

Okay, this makes sense to me.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sun, Jan 14, 2018 at 8:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Jan 14, 2018 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Yeah, but this would mean that now with parallel create index, it is
>>> possible that some tuples from the transaction would end up in index
>>> and others won't.
>>
>> You mean some tuples from some past transaction that deleted a bunch
>> of tuples and committed, but not before someone acquired a still-held
>> snapshot that didn't see the deleter's transaction as committed yet?
>>
>
> I think I am talking about something different.  Let me try to explain
> in some more detail.  Consider a transaction T-1 has deleted two
> tuples from tab-1, first on page-1 and second on page-2 and committed.
> There is a parallel transaction T-2 which has an open snapshot/query
> due to which oldestXmin will be smaller than T-1.   Now, in another
> session, we started parallel Create Index on tab-1 which has launched
> one worker.  The worker decided to scan page-1 and will found that the
> deleted tuple on page-1 is Recently Dead, so will include it in Index.
> In the meantime transaction, T-2 got committed/aborted which allows
> oldestXmin to be greater than the value of transaction T-1 and now
> leader decides to scan the page-2 with freshly computed oldestXmin and
> found that the tuple on that page is Dead and has decided not to
> include it in the index.  So, this leads to a situation where some
> tuples deleted by the transaction will end up in index whereas others
> won't.  Note that I am not arguing that there is any fundamental
> problem with this, but just want to highlight that such a case doesn't
> seem to exist with Create Index.

I must have not done a good job of explaining myself ("You mean some
tuples from some past transaction..."), because this is exactly what I
meant, and was exactly how I understood your original remarks from
Saturday.

In summary, while I do agree that this is different to what we see
with serial index builds, I still don't think that this is a concern
for us.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> More comments:

Attached patch has all open issues worked through, including those
that I respond to or comment on below, as well as the other feedback
from your previous e-mails. Note also that I fixed the issue that Amit
raised, as well as the GetOldestXmin()-argument bug that I noticed in
passing when responding to Amit. I also worked on the attribution in
the commit message.

Before getting to my responses to your most recent round of feedback,
I want to first talk about some refactoring that I decided to do. As
you can see from the master branch, tuplesort_performsort() isn't
necessarily reached for spool2, even when we start out with a spool2
(that is, for many unique index builds, spool2 never even does a
tuplesort_performsort()). We may instead decide to shut down spool2
when it has no (dead) tuples. I made this work just as well for the
parallel case in this latest revision. I had to teach tuplesort.c to
accept an early tuplesort_end() for LEADER() -- it had to be prepared
to release still-waiting workers in some cases, rather than depending
on nbtsort.c having called tuplesort_performsort() already. Several
routines within nbtsort.c that previously knew something about
parallelism now know nothing about it. This seems like a nice win.

Separately, I took advantage of the fact that within the leader, its
*worker* Tuplesortstate can safely call tuplesort_end() before the
leader state's tuplesort_performsort() call.

The overall effect of these two changes is that there is now a
_bt_leader_heapscan() call for the parallel case that nicely mirrors
the serial case's IndexBuildHeapScan() call, and once we're done with
populating spools, no subsequent code needs to know a single thing
about parallelism as a special case. You may notice some small changes
to the tuplesort.h overview, which now advertises that callers can
take advantage of this leeway.

Now on to my responses to your most recent round of feeback...

> BufFileView() looks fairly pointless.  It basically creates a copy of
> the input and, in so doing, destroys the input, which is a lot like
> returning the input parameter except that it uses more cycles.  It
> does do a few things.

While it certainly did occur to me that that was kind of weird, and I
struggled with it on my own for a little while, I ultimately agreed
with Thomas that it added something to have ltsConcatWorkerTapes()
call some buffile function in every iteration of its loop.
(BufFileView() + BufFileViewAppend() are code that Thomas actually
wrote, though I added the asserts and comments myself.)

If you think about this in terms of the interface rather than the
implementation, then it may make more sense. The encapsulation adds
something which might pay off later, such as when extendBufFile()
needs to work with a concatenated set of BufFiles. And even right now,
I cannot simply reuse the BufFile without then losing the assert that
is currently in BufFileViewAppend() (must not have associated shared
fileset assert). So I'd end up asserting less (rather than more) there
if BufFileView() was removed.

It wastes some cycles to not simply use the BufFile directly, but not
terribly many in the grand scheme of things. This happens once per
external sort operation.

> In miscadmin.h, I'd put the prototype for the new GUC next to
> max_worker_processes, not maintenance_work_mem.

But then I'd really have to put it next to max_worker_processes in
globals.c, too. That would mean that it would go under "Primary
determinants of sizes of shared-memory structures" within globals.c,
which seems wrong to me. What do you think?

> The ereport() in index_build will, I think, confuse people when it
> says that there are 0 parallel workers.  I suggest splitting this into
> two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
> "building index \"%s\" on table \"%s\" serially" ...) else ereport(...
> "building index \"%s\" on table \"%s\" in parallel with request for %d
> parallel workers" ...).

WFM. I've simply dropped any reference to leader participation in the
messages here, to keep things simple. This seemed okay because the
only thing that affects leader participation is the
parallel_leader_participation GUC, which is under the user's direct
control at all times, and is unlikely to be changed. Those that really
want further detail have trace_sort for that.

> The logic in IndexBuildHeapRangeScan() around need_register_snapshot
> and OldestXmin seems convoluted and not very well-edited to me.

Having revisited it, I now agree that the code added to
IndexBuildHeapRangeScan() was unclear, primarily in that the
need_unregister_snapshot local variable was overloaded in a weird way.

> I suggest that you go back to the original code organization
> and then just insert an additional case for a caller-supplied scan, so
> that the overall flow looks like this:
>
> if (scan != NULL)
> {
>    ...
> }
> else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
> {
>    ...
> }
> else
> {
>    ...
> }

The problem that I see with this alternative flow is that the "if
(scan != NULL)" and the "else if (IsBootstrapProcessingMode() ||
indexInfo->ii_Concurrent)" blocks clearly must contain code for two
distinct, non-overlapping cases, despite the fact those two cases
actually do overlap somewhat. That is, a call to
IndexBuildHeapRangeScan() may have a (parallel) heap scan argument
(control reaches your first code block), or may not (control reaches
your second or third code block). At the same time, a call to
IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX),
or may need an MVCC snapshot (either by registering its own, or using
the parallel one). These two things are orthogonal.

I think I still get the gist of what you're saying, though. I've come
up with a new structure that is a noticeable improvement on what I
had. Importantly, the new structure let me add a number of
parallelism-agnostic asserts that make sure that every ambuild routine
that supports parallelism gets the details right.

> Along with that, I'd change the name of need_register_snapshot to
> need_unregister_snapshot (it's doing both jobs right now) and
> initialize it to false.

Done.

> + * This support code isn't reliable when called from within a parallel
> + * worker process due to the fact that our state isn't propagated.  This is
> + * why parallel index builds are disallowed on catalogs.  It is possible
> + * that we'll fail to catch an attempted use of a user index undergoing
> + * reindexing due the non-propagation of this state to workers, which is not
> + * ideal, but the problem is not particularly likely to go undetected due to
> + * our not doing better there.
>
> I understand the first two sentences, but I have no idea what the
> third one means, especially the part that says "not particularly
> likely to go undetected due to our not doing better there".  It sounds
> scary that something bad is only "not particularly likely to go
> undetected"; don't we need to detect bad things reliably?

The primary point here, that you said you understood, is that we
definitely need to detect when we're reindexing a catalog index within
the backend, so that systable_beginscan() can do the right thing and
not use the index (we also must avoid assertion failures). My solution
to that problem is, of course, to not allow the use of parallel create
index when REINDEXing a system catalog. That seems 100% fine to me.

There is a little bit of ambiguity about other cases, though -- that's
the secondary point I tried to make within that comment block, and the
part that you took issue with. To put this secondary point another
way: It's possible that we'd fail to detect it if someone's comparator
went bananas and decided it was okay to do SQL access (that resulted
in an index scan of the index undergoing reindex). That does seem
rather unlikely, but I felt it necessary to say something like this
because ReindexIsProcessingIndex() isn't already something that only
deals with catalog indexes -- it works with all indexes.

Anyway, I reworded this. I hope that what I came up with is clearer than before.

> But also,
> you used the word "not" three times and also the prefix "un-", meaning
> "not", once.  Four negations in 13 words! Perhaps I'm not entirely in
> a position to cast aspersions on overly-complex phraseology -- the pot
> calling the kettle black and all that -- but I bet that will be a lot
> clearer if you reduce the number of negations to either 0 or 1.

You're not wrong. Simplified.

> The comment change in standard_planner() doesn't look helpful to me;
> I'd leave it out.

Okay.

> + * tableOid is the table that index is to be built on.  indexOid is the OID
> + * of a index to be created or reindexed (which must be a btree index).
>
> I'd rewrite that first sentence to end "the table on which the index
> is to be built".  The second sentence should say "an index" rather
> than "a index".

Okay.

> But, actually, I think we would be better off just ripping
> leaderWorker/leaderParticipates out of this function altogether.
> compute_parallel_worker() is not really under any illusion that it's
> computing a number of *participants*; it's just computing a number of
> *workers*.

That distinction does seem to cause plenty of confusion. While I
accept what you say about compute_parallel_worker(), I still haven't
gone as far as removing the leaderParticipates argument altogether,
because compute_parallel_worker() isn't the only thing that matters
here. (More on that below.)

> I think it's fine for
> parallel_leader_participation=off to simply mean that you get one
> fewer participants.  That's actually what would happen with parallel
> query, too.  Parallel query would consider
> parallel_leader_participation later, in get_parallel_divisor(), when
> working out the cost of one path vs. another, but it doesn't use it to
> choose the number of workers.  So it seems to me that getting rid of
> all of the workerLeader considerations will make it both simpler and
> more consistent with what we do for queries.

I was aware of those details, and figured that parallel query fudges
the compute_parallel_worker() figure's leader participation in some
sense, and that that was what I needed to compensate for. After all,
when parallel_leader_participation=off, having
compute_parallel_worker() return 1 means rather a different thing to
what it means with parallel_leader_participation=on, even though in
general we seem to assume that parallel_leader_participation can only
make a small difference overall.

Here's what I've done based on your feedback: I've changed the header
comments, but stopped leaderParticipates from affecting the
compute_parallel_worker() calculation (so, as I said,
leaderParticipates stays). The leaderParticipates argument continues
to affect these two aspects of plan_create_index_workers()'s return
value:

1. It continues to be used so we have a total number of participants
(not workers) to apply our must-have-32MB-workMem limit on
participants.

Parallel query has no equivalent of this, and it seems warranted. Note
that this limit is no longer applied when parallel_workers storage
param was set, as discussed.

2. I continue to use the leaderParticipates argument to disallow the
case where there is only one CREATE INDEX participant but parallelism
is in use, because, of course, that clearly makes no sense -- we
should just use a serial sort instead.

(It might make sense to allow this if parallel_leader_participation
was *purely* a testing GUC, only for use by by backend hackers, but
AFAICT it isn't.)

The planner can allow a single participant parallel sequential scan
path to be created without worrying about the fact that that doesn't
make much sense, because a plan with only one parallel participant is
always going to cost more than some serial plan (you will only get a 1
participant parallel sequential scan when force_parallel_mode is on).
Obviously plan_create_index_workers() doesn't generate (partial) paths
at all, so I simply have to get the same outcome (avoiding a senseless
1 participant parallel operation) some other way here.

> If you have an idea how to make a better
> cost model than this for CREATE INDEX, I'm willing to consider other
> options.  If you don't, or want to propose that as a follow-up patch,
> then I think it's OK to use what you've got here for starters.  I just
> don't want it to be more baroque than necessary.

I suspect that the parameters of any cost model for parallel CREATE
INDEX that we're prepared to consider for v11 are: "Use a number of
parallel workers that is one below the number at which the total
duration of the CREATE INDEX either stays the same or goes up".

It's hard to do much better than this within those parameters. I can
see a fairly noticeable benefit to parallelism with 4 parallel workers
and a measly 1MB of maintenance_work_mem (when parallelism is forced)
relative to the serial case with the same amount of memory. At least
on my laptop, it seems to be rather hard to lose relative to a serial
sort when using parallel CREATE INDEX (to be fair, I'm probably
actually using way more memory than 1MB to do this due to FS cache
usage). I can think of a cleverer approach to costing parallel CREATE
INDEX, but it's only cleverer by weighing distributed costs. Not very
relevant, for the time being.

BTW, the 32MB per participant limit within plan_create_index_workers()
was chosen based on the realization that any higher value would make
having a default setting of 2 for max_parallel_maintenance_workers (to
match the max_parallel_workers_per_gather default) pointless when the
default maintenance_work_mem value of 64MB is in use. That's not
terribly scientific, though it at least doesn't come at the expense of
a more scientific idea for a limit like that (I don't actually have
one, you see). I am merely trying to avoid being *gratuitously*
wasteful of shared resources that are difficult to accurately cost in
(e.g., the distributed cost of random I/O to the system as a whole
when we do a parallel index build while ridiculously low on
maintenance_work_mem).

> I think that the naming of the wait events could be improved.  Right
> now, they are named by which kind of process does the waiting, but it
> really should be named based on what the thing for which we're
> waiting.  I also suggest that we could just write Sort instead of
> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
> ParallelSortWorkersDone and ParallelTuplesortLeader ->
> ParallelSortTapeHandover.

WFM. Also added documentation for the wait events to monitoring.sgml,
which I somehow missed the first time around.

> Not for this patch, but I wonder if it might be a worthwhile future
> optimization to allow workers to return multiple tapes to the leader.
> One doesn't want to go crazy with this, of course.  If the worker
> returns 100 tapes, then the leader might get stuck doing multiple
> merge passes, which would be a foolish way to divide up the labor, and
> even if that doesn't happen, Amdahl's law argues for minimizing the
> amount of work that is not done in parallel.  Still, what if a worker
> (perhaps after merging) ends up with 2 or 3 tapes?  Is it really worth
> merging them so that the leader can do a 5-way merge instead of a
> 15-way merge?

I did think about this myself, or rather I thought specifically about
building a serial/bigserial PK during pg_restore, a case that must be
very common. The worker merges for such an index build will typically
be *completely pointless* when all input runs are in sorted order,
because the merge heap will only need to consult the root of the heap
and its two immediate children throughout (commit 24598337c helped
cases like this enormously). You might as well merge hundreds of runs
in the leader, provided you still have enough memory per tape that you
can get the full benefit of OS readahead (this is not that hard when
you're only going to repeatedly read from the same tape anyway).

I'm not too worried about it, though. The overall picture is still
very positive even in this case. The "extra worker merging" isn't
generally a big proportion of the overall cost, especially there. More
importantly, if I tried to do better, it would be the "quicksort with
spillover" cost model story all over again (remember how tedious that
was?). How hard are we prepared to work to ensure that we get it right
when it comes to skipping worker merging, given that users always pay
some overhead, even when that doesn't happen?

Note also that parallel index builds manage to unfairly *gain*
advantage over serial cases (they have the good variety of dumb luck,
rather than the bad variety) in certain other common cases.  This
happens with an *inverse* physical/logical correlation (e.g. a DESC
index builds on a date field). They manage to artificially do better
than theory would predict, simply because a greater number of smaller
quicksorts are much faster during initial run generation, without also
taking a concomitant performance hit at merge time. Thomas showed this
at one point. Note that even that's only true because of the qsort
precheck (what I like to call the "banana skin prone" precheck, that
we added to our qsort implementation in 2006) -- it would be true for
*all* correlations, but that one precheck thing complicates matters.

All of this is a tricky business, and that isn't going to get any easier IMV.

> +     * Make sure that the temp file(s) underlying the tape set are created in
> +     * suitable temp tablespaces.  This is only really needed for serial
> +     * sorts.
>
> This comment makes me wonder whether it is "sorta" needed for parallel sorts.

I removed "really". The point of the comment is that we've already set
up temp tablespaces for the shared fileset in the parallel case.
Shared filesets figure out which tablespaces will be used up-front --
see SharedFileSetInit().

> -    if (trace_sort)
> +    if (trace_sort && !WORKER(state))
>
> I have a feeling we still want to get this output even from workers,
> but maybe I'm missing something.

I updated tuplesort_end() so that trace_sort reports on the end of the
sort, even for worker processes. (We still don't show generic
tuplesort_begin* message for workers, though.)

> +      arg5 indicates serial, parallel worker, or parallel leader sort.</entry>
>
> I think it should say what values are used for each case.

I based this on "arg0 indicates heap, index or datum sort", where it's
implied that the values are respective to the order that they appear
in in the sentence (starting from 0). But okay, I'll do it that way
all the same.

> +    /* Release worker tuplesorts within leader process as soon as possible */
>
> IIUC, the worker tuplesorts aren't really holding onto much of
> anything in terms of resources.  I think it might be better to phrase
> this as /* The sort we just did absorbed the final tapes produced by
> these tuplesorts, which are of no further use. */ or words to that
> effect.

Okay. Done that way.

> Instead of making a special case in CreateParallelContext for
> serializable_okay, maybe index_build should just use SetConfigOption()
> to force the isolation level to READ COMMITTED right after it does
> NewGUCNestLevel().  The change would only be temporary because the
> subsequent call to AtEOXact_GUC() will revert it.

I tried doing it that way, but it doesn't seem workable:

postgres=# begin transaction isolation level serializable ;
BEGIN
postgres=*# reindex index test_unique;
ERROR:  25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
LOCATION:  call_string_check_hook, guc.c:9953

Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify
default_transaction_isolation -- not transaction_isolation.

Instead, I added a bit more to comments within
CreateParallelContext(), to justify what I've done along the lines you
went into. Hopefully this works better for you.

> There is *still* more to review here, but my concentration is fading.
> If you could post an updated patch after adjusting for the comments
> above, I think that would be helpful.  I'm not totally out of things
> to review that I haven't already looked over once, but I think I'm
> close.

I'm impressed with how quickly you're getting through review of the
patch. Hopefully we can keep that momentum up.

Thanks
-- 
Peter Geoghegan

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Prabhat Sahu
Date:
Hi all,

I have been continue doing testing of parallel create index patch. So far
I haven't came across any sort of issue or regression with the patches.
Here are few performance number for the latest round of testing - which
is perform on top of 6th Jan patch submitted by Peter.

Testing is done on openstack instance with:

CUP: 8
RAM : 16GB
HD: 640 GB

postgres=# select pg_size_pretty(pg_total_relation_size
('lineitem'));
 pg_size_pretty 
----------------
 93 GB
(1 row)

-- Test 1. 
max_parallel_workers_maintenance = 2
max_parallel_workers = 16 
max_parallel_workers_per_gather = 8
maintenance_work_mem = 1GB
max_wal_size = 4GB

-- Test 2. 
max_parallel_workers_maintenance = 4 
max_parallel_workers = 16 
max_parallel_workers_per_gather = 8
maintenance_work_mem = 2GB
max_wal_size = 4GB

-- Test 3. 
max_parallel_workers_maintenance = 8
max_parallel_workers = 16 
max_parallel_workers_per_gather = 8
maintenance_work_mem = 4GB
max_wal_size = 4GB

NOTE: All the time taken entries are the median of 3 consecutive runs for the same B-tree index creation query.

Time taken for Parallel Index createion:
Test 1Test 2Test 3
Simple/Composite Indexes:Without patchWith patch ,
max_parallel_workers_maintenance = 2
% ChangeWithout patchWith patch,
max_parallel_workers_maintenance = 4
% ChangeWithout patchWith patch, 
max_parallel_workers_maintenance = 8
% Change
Index on "bigint" column: 
CREATE INDEX li_ordkey_idx1 ON lineitem(l_orderkey);
1062446.462 ms 
(17:42.446)
1024972.273 ms 
(17:04.972)
3.52 %1053468.945 ms 
(17:33.469)
896375.543 ms 
(14:56.376)
17.75 %1082920.703 ms 
(18:02.921)
932550.058 ms 
(15:32.550)
13.88 %
index on "integer" column: 
CREATE INDEX li_lineno_idx2 ON lineitem(l_linenumber);
1538285.499 ms 
(25:38.285)
1201008.423 ms 
(20:01.008)
21.92 %1529837.023 ms 
(25:29.837)
1014188.489 ms 
(16:54.188)
33.70 %1642160.947 ms 
(27:22.161)
978518.253 ms 
(16:18.518)
40.41 %
index on "numeric" column: 
CREATE INDEX li_qty_idx3 ON lineitem(l_quantity);
3968102.568 ms 
(01:06:08.103)
2359304.405 ms 
(39:19.304)
40.54 %4129510.930 ms 
(01:08:49.511)
1680201.644 ms 
(28:00.202)
59.31 %4348248.210 ms 
(01:12:28.248)
1490461.879 ms 
(24:50.462)
65.72 %
index on "character" column: 
CREATE INDEX li_lnst_idx4 ON lineitem(l_linestatus);
1510273.931 ms 
(25:10.274)
1240265.301 ms 
(20:40.265)
17.87 %1516842.985 ms 
(25:16.843)
995730.092 ms 
(16:35.730)
34.35 %1580789.375 ms 
(26:20.789)
984975.746 ms 
(16:24.976)
37.69 %
index on "date" column: 
CREATE INDEX li_shipdt_idx5 ON lineitem(l_shipdate);
1483603.274 ms 
(24:43.603)
1189704.930 ms 
(19:49.705)
19.80 %1498348.925 ms 
(24:58.349)
1040421.626 ms 
(17:20.422)
30.56 %1653651.499 ms 
(27:33.651)
1016305.794 ms 
(16:56.306)
38.54 %
index on "character varying" column: 
CREATE INDEX li_comment_idx6 ON lineitem(l_comment);
6945953.838 ms 
(01:55:45.954)
4329696.334 ms 
(01:12:09.696)
37.66 %6818556.437 ms 
(01:53:38.556)
2834034.054 ms 
(47:14.034)
58.43 %6942285.711 ms 
(01:55:42.286)
2648430.902 ms 
(44:08.431)
61.85 %
composite index on "numeric", "character" columns: 
CREATE INDEX li_qtylnst_idx34 ON lineitem
(l_quantity, l_linestatus);
4961563.400 ms 
(01:22:41.563)
2959722.178 ms 
(49:19.722)
40.34 %5242809.501 ms 
(01:27:22.810)
2077463.136 ms 
(34:37.463)
60.37 %5576765.727 ms 
(01:32:56.766)
1755829.420 ms 
(29:15.829)
68.51 %
composite index on "date", "character varying" columns: 
CREATE INDEX li_shipdtcomment_idx56 ON lineitem
(l_shipdate, l_comment);
4693318.077 ms 
(01:18:13.318)
3181494.454 ms 
(53:01.494)
32.21 %4627624.682 ms 
(01:17:07.625)
2613289.211 ms 
(43:33.289)
43.52 %4719242.965 ms 
(01:18:39.243)
2685516.832 ms 
(44:45.517)
43.09 %


Thanks & Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984


On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> More comments:

Attached patch has all open issues worked through, including those
that I respond to or comment on below, as well as the other feedback
from your previous e-mails. Note also that I fixed the issue that Amit
raised, as well as the GetOldestXmin()-argument bug that I noticed in
passing when responding to Amit. I also worked on the attribution in
the commit message.

Before getting to my responses to your most recent round of feedback,
I want to first talk about some refactoring that I decided to do. As
you can see from the master branch, tuplesort_performsort() isn't
necessarily reached for spool2, even when we start out with a spool2
(that is, for many unique index builds, spool2 never even does a
tuplesort_performsort()). We may instead decide to shut down spool2
when it has no (dead) tuples. I made this work just as well for the
parallel case in this latest revision. I had to teach tuplesort.c to
accept an early tuplesort_end() for LEADER() -- it had to be prepared
to release still-waiting workers in some cases, rather than depending
on nbtsort.c having called tuplesort_performsort() already. Several
routines within nbtsort.c that previously knew something about
parallelism now know nothing about it. This seems like a nice win.

Separately, I took advantage of the fact that within the leader, its
*worker* Tuplesortstate can safely call tuplesort_end() before the
leader state's tuplesort_performsort() call.

The overall effect of these two changes is that there is now a
_bt_leader_heapscan() call for the parallel case that nicely mirrors
the serial case's IndexBuildHeapScan() call, and once we're done with
populating spools, no subsequent code needs to know a single thing
about parallelism as a special case. You may notice some small changes
to the tuplesort.h overview, which now advertises that callers can
take advantage of this leeway.

Now on to my responses to your most recent round of feeback...

> BufFileView() looks fairly pointless.  It basically creates a copy of
> the input and, in so doing, destroys the input, which is a lot like
> returning the input parameter except that it uses more cycles.  It
> does do a few things.

While it certainly did occur to me that that was kind of weird, and I
struggled with it on my own for a little while, I ultimately agreed
with Thomas that it added something to have ltsConcatWorkerTapes()
call some buffile function in every iteration of its loop.
(BufFileView() + BufFileViewAppend() are code that Thomas actually
wrote, though I added the asserts and comments myself.)

If you think about this in terms of the interface rather than the
implementation, then it may make more sense. The encapsulation adds
something which might pay off later, such as when extendBufFile()
needs to work with a concatenated set of BufFiles. And even right now,
I cannot simply reuse the BufFile without then losing the assert that
is currently in BufFileViewAppend() (must not have associated shared
fileset assert). So I'd end up asserting less (rather than more) there
if BufFileView() was removed.

It wastes some cycles to not simply use the BufFile directly, but not
terribly many in the grand scheme of things. This happens once per
external sort operation.

> In miscadmin.h, I'd put the prototype for the new GUC next to
> max_worker_processes, not maintenance_work_mem.

But then I'd really have to put it next to max_worker_processes in
globals.c, too. That would mean that it would go under "Primary
determinants of sizes of shared-memory structures" within globals.c,
which seems wrong to me. What do you think?

> The ereport() in index_build will, I think, confuse people when it
> says that there are 0 parallel workers.  I suggest splitting this into
> two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
> "building index \"%s\" on table \"%s\" serially" ...) else ereport(...
> "building index \"%s\" on table \"%s\" in parallel with request for %d
> parallel workers" ...).

WFM. I've simply dropped any reference to leader participation in the
messages here, to keep things simple. This seemed okay because the
only thing that affects leader participation is the
parallel_leader_participation GUC, which is under the user's direct
control at all times, and is unlikely to be changed. Those that really
want further detail have trace_sort for that.

> The logic in IndexBuildHeapRangeScan() around need_register_snapshot
> and OldestXmin seems convoluted and not very well-edited to me.

Having revisited it, I now agree that the code added to
IndexBuildHeapRangeScan() was unclear, primarily in that the
need_unregister_snapshot local variable was overloaded in a weird way.

> I suggest that you go back to the original code organization
> and then just insert an additional case for a caller-supplied scan, so
> that the overall flow looks like this:
>
> if (scan != NULL)
> {
>    ...
> }
> else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
> {
>    ...
> }
> else
> {
>    ...
> }

The problem that I see with this alternative flow is that the "if
(scan != NULL)" and the "else if (IsBootstrapProcessingMode() ||
indexInfo->ii_Concurrent)" blocks clearly must contain code for two
distinct, non-overlapping cases, despite the fact those two cases
actually do overlap somewhat. That is, a call to
IndexBuildHeapRangeScan() may have a (parallel) heap scan argument
(control reaches your first code block), or may not (control reaches
your second or third code block). At the same time, a call to
IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX),
or may need an MVCC snapshot (either by registering its own, or using
the parallel one). These two things are orthogonal.

I think I still get the gist of what you're saying, though. I've come
up with a new structure that is a noticeable improvement on what I
had. Importantly, the new structure let me add a number of
parallelism-agnostic asserts that make sure that every ambuild routine
that supports parallelism gets the details right.

> Along with that, I'd change the name of need_register_snapshot to
> need_unregister_snapshot (it's doing both jobs right now) and
> initialize it to false.

Done.

> + * This support code isn't reliable when called from within a parallel
> + * worker process due to the fact that our state isn't propagated.  This is
> + * why parallel index builds are disallowed on catalogs.  It is possible
> + * that we'll fail to catch an attempted use of a user index undergoing
> + * reindexing due the non-propagation of this state to workers, which is not
> + * ideal, but the problem is not particularly likely to go undetected due to
> + * our not doing better there.
>
> I understand the first two sentences, but I have no idea what the
> third one means, especially the part that says "not particularly
> likely to go undetected due to our not doing better there".  It sounds
> scary that something bad is only "not particularly likely to go
> undetected"; don't we need to detect bad things reliably?

The primary point here, that you said you understood, is that we
definitely need to detect when we're reindexing a catalog index within
the backend, so that systable_beginscan() can do the right thing and
not use the index (we also must avoid assertion failures). My solution
to that problem is, of course, to not allow the use of parallel create
index when REINDEXing a system catalog. That seems 100% fine to me.

There is a little bit of ambiguity about other cases, though -- that's
the secondary point I tried to make within that comment block, and the
part that you took issue with. To put this secondary point another
way: It's possible that we'd fail to detect it if someone's comparator
went bananas and decided it was okay to do SQL access (that resulted
in an index scan of the index undergoing reindex). That does seem
rather unlikely, but I felt it necessary to say something like this
because ReindexIsProcessingIndex() isn't already something that only
deals with catalog indexes -- it works with all indexes.

Anyway, I reworded this. I hope that what I came up with is clearer than before.

> But also,
> you used the word "not" three times and also the prefix "un-", meaning
> "not", once.  Four negations in 13 words! Perhaps I'm not entirely in
> a position to cast aspersions on overly-complex phraseology -- the pot
> calling the kettle black and all that -- but I bet that will be a lot
> clearer if you reduce the number of negations to either 0 or 1.

You're not wrong. Simplified.

> The comment change in standard_planner() doesn't look helpful to me;
> I'd leave it out.

Okay.

> + * tableOid is the table that index is to be built on.  indexOid is the OID
> + * of a index to be created or reindexed (which must be a btree index).
>
> I'd rewrite that first sentence to end "the table on which the index
> is to be built".  The second sentence should say "an index" rather
> than "a index".

Okay.

> But, actually, I think we would be better off just ripping
> leaderWorker/leaderParticipates out of this function altogether.
> compute_parallel_worker() is not really under any illusion that it's
> computing a number of *participants*; it's just computing a number of
> *workers*.

That distinction does seem to cause plenty of confusion. While I
accept what you say about compute_parallel_worker(), I still haven't
gone as far as removing the leaderParticipates argument altogether,
because compute_parallel_worker() isn't the only thing that matters
here. (More on that below.)

> I think it's fine for
> parallel_leader_participation=off to simply mean that you get one
> fewer participants.  That's actually what would happen with parallel
> query, too.  Parallel query would consider
> parallel_leader_participation later, in get_parallel_divisor(), when
> working out the cost of one path vs. another, but it doesn't use it to
> choose the number of workers.  So it seems to me that getting rid of
> all of the workerLeader considerations will make it both simpler and
> more consistent with what we do for queries.

I was aware of those details, and figured that parallel query fudges
the compute_parallel_worker() figure's leader participation in some
sense, and that that was what I needed to compensate for. After all,
when parallel_leader_participation=off, having
compute_parallel_worker() return 1 means rather a different thing to
what it means with parallel_leader_participation=on, even though in
general we seem to assume that parallel_leader_participation can only
make a small difference overall.

Here's what I've done based on your feedback: I've changed the header
comments, but stopped leaderParticipates from affecting the
compute_parallel_worker() calculation (so, as I said,
leaderParticipates stays). The leaderParticipates argument continues
to affect these two aspects of plan_create_index_workers()'s return
value:

1. It continues to be used so we have a total number of participants
(not workers) to apply our must-have-32MB-workMem limit on
participants.

Parallel query has no equivalent of this, and it seems warranted. Note
that this limit is no longer applied when parallel_workers storage
param was set, as discussed.

2. I continue to use the leaderParticipates argument to disallow the
case where there is only one CREATE INDEX participant but parallelism
is in use, because, of course, that clearly makes no sense -- we
should just use a serial sort instead.

(It might make sense to allow this if parallel_leader_participation
was *purely* a testing GUC, only for use by by backend hackers, but
AFAICT it isn't.)

The planner can allow a single participant parallel sequential scan
path to be created without worrying about the fact that that doesn't
make much sense, because a plan with only one parallel participant is
always going to cost more than some serial plan (you will only get a 1
participant parallel sequential scan when force_parallel_mode is on).
Obviously plan_create_index_workers() doesn't generate (partial) paths
at all, so I simply have to get the same outcome (avoiding a senseless
1 participant parallel operation) some other way here.

> If you have an idea how to make a better
> cost model than this for CREATE INDEX, I'm willing to consider other
> options.  If you don't, or want to propose that as a follow-up patch,
> then I think it's OK to use what you've got here for starters.  I just
> don't want it to be more baroque than necessary.

I suspect that the parameters of any cost model for parallel CREATE
INDEX that we're prepared to consider for v11 are: "Use a number of
parallel workers that is one below the number at which the total
duration of the CREATE INDEX either stays the same or goes up".

It's hard to do much better than this within those parameters. I can
see a fairly noticeable benefit to parallelism with 4 parallel workers
and a measly 1MB of maintenance_work_mem (when parallelism is forced)
relative to the serial case with the same amount of memory. At least
on my laptop, it seems to be rather hard to lose relative to a serial
sort when using parallel CREATE INDEX (to be fair, I'm probably
actually using way more memory than 1MB to do this due to FS cache
usage). I can think of a cleverer approach to costing parallel CREATE
INDEX, but it's only cleverer by weighing distributed costs. Not very
relevant, for the time being.

BTW, the 32MB per participant limit within plan_create_index_workers()
was chosen based on the realization that any higher value would make
having a default setting of 2 for max_parallel_maintenance_workers (to
match the max_parallel_workers_per_gather default) pointless when the
default maintenance_work_mem value of 64MB is in use. That's not
terribly scientific, though it at least doesn't come at the expense of
a more scientific idea for a limit like that (I don't actually have
one, you see). I am merely trying to avoid being *gratuitously*
wasteful of shared resources that are difficult to accurately cost in
(e.g., the distributed cost of random I/O to the system as a whole
when we do a parallel index build while ridiculously low on
maintenance_work_mem).

> I think that the naming of the wait events could be improved.  Right
> now, they are named by which kind of process does the waiting, but it
> really should be named based on what the thing for which we're
> waiting.  I also suggest that we could just write Sort instead of
> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
> ParallelSortWorkersDone and ParallelTuplesortLeader ->
> ParallelSortTapeHandover.

WFM. Also added documentation for the wait events to monitoring.sgml,
which I somehow missed the first time around.

> Not for this patch, but I wonder if it might be a worthwhile future
> optimization to allow workers to return multiple tapes to the leader.
> One doesn't want to go crazy with this, of course.  If the worker
> returns 100 tapes, then the leader might get stuck doing multiple
> merge passes, which would be a foolish way to divide up the labor, and
> even if that doesn't happen, Amdahl's law argues for minimizing the
> amount of work that is not done in parallel.  Still, what if a worker
> (perhaps after merging) ends up with 2 or 3 tapes?  Is it really worth
> merging them so that the leader can do a 5-way merge instead of a
> 15-way merge?

I did think about this myself, or rather I thought specifically about
building a serial/bigserial PK during pg_restore, a case that must be
very common. The worker merges for such an index build will typically
be *completely pointless* when all input runs are in sorted order,
because the merge heap will only need to consult the root of the heap
and its two immediate children throughout (commit 24598337c helped
cases like this enormously). You might as well merge hundreds of runs
in the leader, provided you still have enough memory per tape that you
can get the full benefit of OS readahead (this is not that hard when
you're only going to repeatedly read from the same tape anyway).

I'm not too worried about it, though. The overall picture is still
very positive even in this case. The "extra worker merging" isn't
generally a big proportion of the overall cost, especially there. More
importantly, if I tried to do better, it would be the "quicksort with
spillover" cost model story all over again (remember how tedious that
was?). How hard are we prepared to work to ensure that we get it right
when it comes to skipping worker merging, given that users always pay
some overhead, even when that doesn't happen?

Note also that parallel index builds manage to unfairly *gain*
advantage over serial cases (they have the good variety of dumb luck,
rather than the bad variety) in certain other common cases.  This
happens with an *inverse* physical/logical correlation (e.g. a DESC
index builds on a date field). They manage to artificially do better
than theory would predict, simply because a greater number of smaller
quicksorts are much faster during initial run generation, without also
taking a concomitant performance hit at merge time. Thomas showed this
at one point. Note that even that's only true because of the qsort
precheck (what I like to call the "banana skin prone" precheck, that
we added to our qsort implementation in 2006) -- it would be true for
*all* correlations, but that one precheck thing complicates matters.

All of this is a tricky business, and that isn't going to get any easier IMV.

> +     * Make sure that the temp file(s) underlying the tape set are created in
> +     * suitable temp tablespaces.  This is only really needed for serial
> +     * sorts.
>
> This comment makes me wonder whether it is "sorta" needed for parallel sorts.

I removed "really". The point of the comment is that we've already set
up temp tablespaces for the shared fileset in the parallel case.
Shared filesets figure out which tablespaces will be used up-front --
see SharedFileSetInit().

> -    if (trace_sort)
> +    if (trace_sort && !WORKER(state))
>
> I have a feeling we still want to get this output even from workers,
> but maybe I'm missing something.

I updated tuplesort_end() so that trace_sort reports on the end of the
sort, even for worker processes. (We still don't show generic
tuplesort_begin* message for workers, though.)

> +      arg5 indicates serial, parallel worker, or parallel leader sort.</entry>
>
> I think it should say what values are used for each case.

I based this on "arg0 indicates heap, index or datum sort", where it's
implied that the values are respective to the order that they appear
in in the sentence (starting from 0). But okay, I'll do it that way
all the same.

> +    /* Release worker tuplesorts within leader process as soon as possible */
>
> IIUC, the worker tuplesorts aren't really holding onto much of
> anything in terms of resources.  I think it might be better to phrase
> this as /* The sort we just did absorbed the final tapes produced by
> these tuplesorts, which are of no further use. */ or words to that
> effect.

Okay. Done that way.

> Instead of making a special case in CreateParallelContext for
> serializable_okay, maybe index_build should just use SetConfigOption()
> to force the isolation level to READ COMMITTED right after it does
> NewGUCNestLevel().  The change would only be temporary because the
> subsequent call to AtEOXact_GUC() will revert it.

I tried doing it that way, but it doesn't seem workable:

postgres=# begin transaction isolation level serializable ;
BEGIN
postgres=*# reindex index test_unique;
ERROR:  25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
LOCATION:  call_string_check_hook, guc.c:9953

Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify
default_transaction_isolation -- not transaction_isolation.

Instead, I added a bit more to comments within
CreateParallelContext(), to justify what I've done along the lines you
went into. Hopefully this works better for you.

> There is *still* more to review here, but my concentration is fading.
> If you could post an updated patch after adjusting for the comments
> above, I think that would be helpful.  I'm not totally out of things
> to review that I haven't already looked over once, but I think I'm
> close.

I'm impressed with how quickly you're getting through review of the
patch. Hopefully we can keep that momentum up.

Thanks
--
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> More comments:
>
> Attached patch has all open issues worked through, including those
> that I respond to or comment on below, as well as the other feedback
> from your previous e-mails. Note also that I fixed the issue that Amit
> raised,
>

I could still reproduce it. I think the way you have fixed it has a
race condition.  In _bt_parallel_scan_and_sort(), the value of
brokenhotchain is set after you signal the leader that the worker is
done (by incrementing workersFinished). Now, the leader is free to
decide based on the current shared state which can give the wrong
value.  Similarly, I think the value of havedead and reltuples can
also be wrong.

You neither seem to have fixed nor responded to the second problem
mentioned in my email upthread [1].  To reiterate, the problem is that
we can't assume that the workers we have launched will always start
and finish. It is possible that postmaster fails to start the worker
due to fork failure. In such conditions, tuplesort_leader_wait will
hang indefinitely because it will wait for the workersFinished count
to become equal to launched workers (+1, if leader participates) which
will never happen.  Am I missing something due to which this won't be
a problem?

Now, I think one argument is that such a problem can happen in a
parallel query, so it is not the responsibility of this patch to solve
it.  However, we already have a patch (there are some review comments
that needs to be addressed in the proposed patch) to solve it and this
patch is adding a new path in the code which has similar symptoms
which can't be fixed with the already proposed patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BizMyxzFD6k81Deyar35YJ5qdpbRTUp9cQvo%2BniQom7Q%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 17, 2018 at 5:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I could still reproduce it. I think the way you have fixed it has a
> race condition.  In _bt_parallel_scan_and_sort(), the value of
> brokenhotchain is set after you signal the leader that the worker is
> done (by incrementing workersFinished). Now, the leader is free to
> decide based on the current shared state which can give the wrong
> value.  Similarly, I think the value of havedead and reltuples can
> also be wrong.

> You neither seem to have fixed nor responded to the second problem
> mentioned in my email upthread [1].  To reiterate, the problem is that
> we can't assume that the workers we have launched will always start
> and finish. It is possible that postmaster fails to start the worker
> due to fork failure. In such conditions, tuplesort_leader_wait will
> hang indefinitely because it will wait for the workersFinished count
> to become equal to launched workers (+1, if leader participates) which
> will never happen.  Am I missing something due to which this won't be
> a problem?

I think that both problems (the live _bt_parallel_scan_and_sort() bug,
as well as the general issue with needing to account for parallel
worker fork() failure) are likely solvable by not using
tuplesort_leader_wait(), and instead calling
WaitForParallelWorkersToFinish(). Which you suggested already.

Separately, I will need to monitor that bugfix patch, and check its
progress, to make sure that what I add is comparable to what
ultimately gets committed for parallel query.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think that both problems (the live _bt_parallel_scan_and_sort() bug,
> as well as the general issue with needing to account for parallel
> worker fork() failure) are likely solvable by not using
> tuplesort_leader_wait(), and instead calling
> WaitForParallelWorkersToFinish(). Which you suggested already.

I'm wondering if this shouldn't instead be handled by using the new
Barrier facilities.  I think it would work like this:

- leader calls BarrierInit(..., 0)
- leader calls BarrierAttach() before starting workers.
- each worker, before reading anything from the parallel scan, calls
BarrierAttach().  if the phase returned is greater than 0, then the
worker arrived at the barrier after all the work was done, and should
exit immediately.
- each worker, after finishing sorting, calls BarrierArriveAndWait().
leader, after sorting, also calls BarrierArriveAndWait().
- when BarrierArriveAndWait() returns in the leader, all workers that
actually started (and did so quickly enough) have arrived at the
barrier.  The leader can now do leader_takeover_tapes, being careful
to adopt only the tapes actually created, since some workers may have
failed to launch or launched only after sorting was already complete.
- meanwhile, the workers again call BarrierArriveAndWait().
- after it's done taking over tapes, the leader calls BarrierDetach(),
releasing the workers.
- the workers call BarrierDetach() and then exit -- or maybe they
don't even really need to detach

So the barrier phase numbers would have the following meanings:

0 - sorting
1 - taking over tapes
2 - done

This could be slightly more elegant if BarrierArriveAndWait() had an
additional argument indicating the phase number for which the backend
could wait, or maybe the number of phases for which it should wait.
Then, the workers could avoid having to call BarrierArriveAndWait()
twice in a row.

While I find the Barrier API slightly confusing -- and I suspect I'm
not entirely alone -- I don't think that's a good excuse for
reinventing the wheel.  The problem of needing to wait for every
process that does A (in this case, read tuples from the scan) to also
do B (in this case, finish sorting those tuples) is a very general one
that is deserving of a general solution.  Unless somebody comes up
with a better plan, Barrier seems to be the way to do that in
PostgreSQL.

I don't think using WaitForParallelWorkersToFinish() is a good idea.
That would require workers to hold onto their tuplesorts until after
losing the ability to send messages to the leader, which doesn't sound
like a very good plan.  We don't want workers to detach from their
error queues until the bitter end, lest errors go unreported.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Mon, Jan 15, 2018 at 7:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> BufFileView() looks fairly pointless.  It basically creates a copy of
>> the input and, in so doing, destroys the input, which is a lot like
>> returning the input parameter except that it uses more cycles.  It
>> does do a few things.
>
> While it certainly did occur to me that that was kind of weird, and I
> struggled with it on my own for a little while, I ultimately agreed
> with Thomas that it added something to have ltsConcatWorkerTapes()
> call some buffile function in every iteration of its loop.
> (BufFileView() + BufFileViewAppend() are code that Thomas actually
> wrote, though I added the asserts and comments myself.)

Hmm, well, if Thomas contributed code to this patch, then he needs to
be listed as an author.  I went searching for an email on this thread
(or any other) where he posted code for this, thinking that there
might be some discussion explaining the motivation, but I didn't find
any.  I'm still in favor of erasing this distinction.

> If you think about this in terms of the interface rather than the
> implementation, then it may make more sense. The encapsulation adds
> something which might pay off later, such as when extendBufFile()
> needs to work with a concatenated set of BufFiles. And even right now,
> I cannot simply reuse the BufFile without then losing the assert that
> is currently in BufFileViewAppend() (must not have associated shared
> fileset assert). So I'd end up asserting less (rather than more) there
> if BufFileView() was removed.

I would see the encapsulation as having some value if the original
BufFile remained valid and the new view were also valid.  Then the
BufFileView operation is a bit like a copy-on-write filesystem
snapshot: you have the original, which you can do stuff with, and you
have a copy, which can be manipulated independently, but the copying
is cheap.  But here the BufFile gobbles up the original so I don't see
the point.

The  Assert(target->fileset == NULL) that would be lost in
BufFileViewAppend has no value anyway, AFAICS.  There is also
Assert(source->readOnly) given which the presence or absence of the
fileset makes no difference.  And if, as you say, extendBufFile were
eventually made to work here, this Assert would presumably get removed
anyway; I think we'd likely want the additional files to get
associated with the shared file set rather than being locally
temporary files.

> It wastes some cycles to not simply use the BufFile directly, but not
> terribly many in the grand scheme of things. This happens once per
> external sort operation.

I'm not at all concerned about the loss of cycles.  I'm concerned
about making the mechanism more complicated to understand and maintain
for future readers of the code.  When experienced hackers see code
that doesn't seem to accomplish anything, they (or at least I) tend to
assume that there must be a hidden reason for it to be there and spend
time trying to figure out what it is.  If there actually is no hidden
purpose, then that study is a waste of time and we can spare them the
trouble by getting rid of it now.

>> In miscadmin.h, I'd put the prototype for the new GUC next to
>> max_worker_processes, not maintenance_work_mem.
>
> But then I'd really have to put it next to max_worker_processes in
> globals.c, too. That would mean that it would go under "Primary
> determinants of sizes of shared-memory structures" within globals.c,
> which seems wrong to me. What do you think?

OK, that's a fair point.

> I think I still get the gist of what you're saying, though. I've come
> up with a new structure that is a noticeable improvement on what I
> had. Importantly, the new structure let me add a number of
> parallelism-agnostic asserts that make sure that every ambuild routine
> that supports parallelism gets the details right.

Yes, that looks better.  I'm slightly dubious that the new Asserts()
are worthwhile, but I guess it's OK.  But I think it would be better
to ditch the if-statement and do it like this:

Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin)
    : !TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);

Also, I think you've got a little more than you need in terms of
comments.  I would keep the comments for the serial case and parallel
case and drop the earlier one that basically says the same thing:

+     * (Note that parallel case never has us register/unregister snapshot, and
+     * provides appropriate snapshot for us.)

> There is a little bit of ambiguity about other cases, though -- that's
> the secondary point I tried to make within that comment block, and the
> part that you took issue with. To put this secondary point another
> way: It's possible that we'd fail to detect it if someone's comparator
> went bananas and decided it was okay to do SQL access (that resulted
> in an index scan of the index undergoing reindex). That does seem
> rather unlikely, but I felt it necessary to say something like this
> because ReindexIsProcessingIndex() isn't already something that only
> deals with catalog indexes -- it works with all indexes.

I agree that it isn't particularly likely, but if somebody found it
worthwhile to insert guards against those cases, maybe we should
preserve them instead of abandoning them.  It shouldn't be that hard
to propagate those values from the leader to the workers.  The main
difficulty there seems to be that we're creating the parallel context
in nbtsort.c, while the state that would need to be propagated is
private to index.c, but there are several ways to solve that problem.
It looks to me like the most robust approach would be to just make
that part of what parallel.c naturally does.  Patch for that attached.

> Here's what I've done based on your feedback: I've changed the header
> comments, but stopped leaderParticipates from affecting the
> compute_parallel_worker() calculation (so, as I said,
> leaderParticipates stays). The leaderParticipates argument continues
> to affect these two aspects of plan_create_index_workers()'s return
> value:
>
> 1. It continues to be used so we have a total number of participants
> (not workers) to apply our must-have-32MB-workMem limit on
> participants.
>
> Parallel query has no equivalent of this, and it seems warranted. Note
> that this limit is no longer applied when parallel_workers storage
> param was set, as discussed.
>
> 2. I continue to use the leaderParticipates argument to disallow the
> case where there is only one CREATE INDEX participant but parallelism
> is in use, because, of course, that clearly makes no sense -- we
> should just use a serial sort instead.

That's an improvement, but see below.

> (It might make sense to allow this if parallel_leader_participation
> was *purely* a testing GUC, only for use by by backend hackers, but
> AFAICT it isn't.)

As applied to parallel CREATE INDEX, it pretty much is just a testing
GUC, which is why I was skeptical about leaving support for it in the
patch.  There's no anticipated advantage to having the leader not
participate -- unlike for parallel queries, where it is quite possible
that setting parallel_leader_participation=off could be a win, even
generally.  If you just have a Gather over a parallel sequential scan,
it is unlikely that parallel_leader_participation=off will help; it
will most likely hurt, at least up to the point where more
participants become a bad idea in general due to contention.  However,
if you have a complex plan involving fairly-large operations that
cannot be divided up among workers, such as a Parallel Append or a
Hash Join with a big startup cost or a Sort that happens in the worker
or even a parallel Index Scan that takes a long time to advance to the
next page because it has to do I/O, you might leave workers idling
while the leader is trying to "help".  Some users may have workloads
where this is the normal case.  Ideally, the planner would figure out
whether this is likely and tell the leader whether or not to
participate, but we don't know how to figure that out yet.  On the
other hand, for CREATE INDEX, having the leader not participate can't
really improve anything.

In other words, right now, parallel_leader_participation is not
strictly a testing GUC, but if we make CREATE INDEX respect it, then
we're pushing it towards being a GUC that you don't ever want to
enable except for testing.  I'm still not sure that's a very good
idea, but if we're going to do it, then surely we should be
consistent.  It's true that having one worker and no parallel leader
participation can never be better than just having the leader do it,
but it is also true that having two leaders and no parallel leader
participation can never be better than having 1 worker with leader
participation.  I don't see a reason to treat those cases differently.

If we're going to keep parallel_leader_participation support here, I
think the last hunk in config.sgml should read more like this:

Allows the leader process to execute the query plan under
<literal>Gather</literal> and <literal>Gather Merge</literal> nodes
and to participate in parallel index builds.  The default is
<literal>on</literal>.  For queries, setting this value to
<literal>off</literal> reduces the likelihood that workers will become
blocked because the leader is not reading tuples fast enough, but
requires the leader process to wait for worker processes to start up
before the first tuples can be produced.  The degree to which the
leader can help or hinder performance depends on the plan type or
index build strategy, number of workers and query duration.  For index
builds, setting this value to <literal>off</literal> is expected to
reduce performance, but may be useful for testing purposes.

> I suspect that the parameters of any cost model for parallel CREATE
> INDEX that we're prepared to consider for v11 are: "Use a number of
> parallel workers that is one below the number at which the total
> duration of the CREATE INDEX either stays the same or goes up".

That's pretty much the definition of a correct cost model; the trick
is how to implement it without an oracle.

> BTW, the 32MB per participant limit within plan_create_index_workers()
> was chosen based on the realization that any higher value would make
> having a default setting of 2 for max_parallel_maintenance_workers (to
> match the max_parallel_workers_per_gather default) pointless when the
> default maintenance_work_mem value of 64MB is in use. That's not
> terribly scientific, though it at least doesn't come at the expense of
> a more scientific idea for a limit like that (I don't actually have
> one, you see). I am merely trying to avoid being *gratuitously*
> wasteful of shared resources that are difficult to accurately cost in
> (e.g., the distributed cost of random I/O to the system as a whole
> when we do a parallel index build while ridiculously low on
> maintenance_work_mem).

I see.  I think it's a good start.  I wonder in general whether it's
better to add memory or add workers.  In other words, suppose I have a
busy system where my index builds are slow.  Should I try to free up
some memory so that I can raise maintenance_work_mem, or should I try
to free up some CPU resources so I can raise
max_parallel_maintenance_workers?  The answer doubtless depends on the
current values that I have configured for those settings and the type
of data that I'm indexing, as well as how much memory I could free up
how easily and how much CPU I could free up how easily.  But I wish I
understood better than I do which one was more likely to help in a
given situation.

I also wonder what the next steps would be to make this whole thing
scale better.  From the performance tests that have been performed so
far, it seems like adding a modest number of workers definitely helps,
but it tops out around 2-3x with 4-8 workers.  I understand from your
previous comments that's typical of other databases.  It also seems
pretty clear that more memory helps but only to a point.  For
instance, I just tried "create index x on pgbench_accounts (aid)"
without your patch at scale factor 1000.  With maintenance_work_mem =
1MB, it generated 6689 runs and took 131 seconds.  With
maintenance_work_mem = 64MB, it took 67 seconds.  With
maintenance_work_mem = 1GB, it took 60 seconds.  More memory didn't
help, even if the sort could be made entirely internal. This seems to
be a fairly typical pattern: using enough memory can buy you a small
multiple, using a bunch of workers can buy you a small multiple, but
then it just doesn't get faster.  Yet, in theory, it seems like if
we're willing to provide essentially unlimited memory and CPU
resources, we ought to be able to make this go almost arbitrarily
fast.

>> I think that the naming of the wait events could be improved.  Right
>> now, they are named by which kind of process does the waiting, but it
>> really should be named based on what the thing for which we're
>> waiting.  I also suggest that we could just write Sort instead of
>> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
>> ParallelSortWorkersDone and ParallelTuplesortLeader ->
>> ParallelSortTapeHandover.
>
> WFM. Also added documentation for the wait events to monitoring.sgml,
> which I somehow missed the first time around.

But you forgot to update the preceding "morerows" line, so the
formatting will be all messed up.

>> +     * Make sure that the temp file(s) underlying the tape set are created in
>> +     * suitable temp tablespaces.  This is only really needed for serial
>> +     * sorts.
>>
>> This comment makes me wonder whether it is "sorta" needed for parallel sorts.
>
> I removed "really". The point of the comment is that we've already set
> up temp tablespaces for the shared fileset in the parallel case.
> Shared filesets figure out which tablespaces will be used up-front --
> see SharedFileSetInit().

So why not say it that way?  i.e. For parallel sorts, this should have
been done already, but it doesn't matter if it gets done twice.

 > I updated tuplesort_end() so that trace_sort reports on the end of the
> sort, even for worker processes. (We still don't show generic
> tuplesort_begin* message for workers, though.)

I don't see any reason not to make those contingent only on
trace_sort.  The user can puzzle apart which messages are which from
the PIDs in the logfile.

>> Instead of making a special case in CreateParallelContext for
>> serializable_okay, maybe index_build should just use SetConfigOption()
>> to force the isolation level to READ COMMITTED right after it does
>> NewGUCNestLevel().  The change would only be temporary because the
>> subsequent call to AtEOXact_GUC() will revert it.
>
> I tried doing it that way, but it doesn't seem workable:
>
> postgres=# begin transaction isolation level serializable ;
> BEGIN
> postgres=*# reindex index test_unique;
> ERROR:  25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
> LOCATION:  call_string_check_hook, guc.c:9953

Bummer.

> Instead, I added a bit more to comments within
> CreateParallelContext(), to justify what I've done along the lines you
> went into. Hopefully this works better for you.

Yeah, that seems OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 17, 2018 at 10:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I think that both problems (the live _bt_parallel_scan_and_sort() bug,
>> as well as the general issue with needing to account for parallel
>> worker fork() failure) are likely solvable by not using
>> tuplesort_leader_wait(), and instead calling
>> WaitForParallelWorkersToFinish(). Which you suggested already.
>
> I'm wondering if this shouldn't instead be handled by using the new
> Barrier facilities.

> While I find the Barrier API slightly confusing -- and I suspect I'm
> not entirely alone -- I don't think that's a good excuse for
> reinventing the wheel.  The problem of needing to wait for every
> process that does A (in this case, read tuples from the scan) to also
> do B (in this case, finish sorting those tuples) is a very general one
> that is deserving of a general solution.  Unless somebody comes up
> with a better plan, Barrier seems to be the way to do that in
> PostgreSQL.
>
> I don't think using WaitForParallelWorkersToFinish() is a good idea.
> That would require workers to hold onto their tuplesorts until after
> losing the ability to send messages to the leader, which doesn't sound
> like a very good plan.  We don't want workers to detach from their
> error queues until the bitter end, lest errors go unreported.

What you say here sounds convincing to me. I actually brought up the
idea of using the barrier abstraction a little over a month ago. I was
discouraged by a complicated sounding issue raised by Thomas [1]. At
the time, I figured that the barrier abstraction was a nice to have,
but not really essential. That idea doesn't hold up under scrutiny. I
need to be able to use barriers.

There seems to be some yak shaving involved in getting the barrier
abstraction to do exactly what is required, as Thomas went into at the
time. How should that prerequisite work be structured? For example,
should a patch be spun off for that part?

I may not be the most qualified person for this job, since Thomas
considered two alternative approaches (to making the static barrier
abstraction forget about never-launched participants) without ever
settling on one of them.

[1] https://postgr.es/m/CAEepm=03YnefpCeB=Z67HtQAOEMuhKGyPCY_S1TeH=9a2Rr0LQ@mail.gmail.com
-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Jan 17, 2018 at 7:00 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> There seems to be some yak shaving involved in getting the barrier
> abstraction to do exactly what is required, as Thomas went into at the
> time. How should that prerequisite work be structured? For example,
> should a patch be spun off for that part?
>
> I may not be the most qualified person for this job, since Thomas
> considered two alternative approaches (to making the static barrier
> abstraction forget about never-launched participants) without ever
> settling on one of them.

I had forgotten about the previous discussion.  The sketch in my
previous email supposed that we would use dynamic barriers since the
whole point, after all, is to handle the fact that we don't know how
many participants will really show up.  Thomas's idea seems to be that
the leader will initialize the barrier based on the anticipated number
of participants and then tell it to forget about the participants that
don't materialize.  Of course, that would require that the leader
somehow figure out how many participants didn't show up so that it can
deduct then from the counter in the barrier.  And how is it going to
do that?

It's true that the leader will know the value of nworkers_launched,
but as the comment in LaunchParallelWorkers() says: "The caller must
be able to tolerate ending up with fewer workers than expected, so
there is no need to throw an error here if registration fails.  It
wouldn't help much anyway, because registering the worker in no way
guarantees that it will start up and initialize successfully."  So it
seems to me that a much better plan than having the leader try to
figure out how many workers failed to launch would be to just keep a
count of how many workers did in fact launch.  The count can be stored
in shared memory, and each worker that comes along can increment it.
Then we don't have to worry about whether we accurately detect failure
to launch.  We can argue about whether it's possible to detect all
cases of failure to launch unerringly, but what's for sure is that if
a worker increments a counter in shared memory, it launched.  Now,
where should this counter be located?  There are of course multiple
possibilities, but in my sketch it goes in
some_barrier_variable->nparticipants i.e. we just use a dynamic
barrier.

So my position (at least until Thomas or Andres shows up and tells me
why I'm wrong) is that you can use the Barrier API just as it is
without any yak-shaving, just by following the sketch I set out
before.  The additional API I proposed in that sketch isn't really
required, although it might be more efficient.  But it doesn't really
matter: if that comes along later, it will be trivial to adjust the
code to take advantage of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> While it certainly did occur to me that that was kind of weird, and I
>> struggled with it on my own for a little while, I ultimately agreed
>> with Thomas that it added something to have ltsConcatWorkerTapes()
>> call some buffile function in every iteration of its loop.
>> (BufFileView() + BufFileViewAppend() are code that Thomas actually
>> wrote, though I added the asserts and comments myself.)
>
> Hmm, well, if Thomas contributed code to this patch, then he needs to
> be listed as an author.  I went searching for an email on this thread
> (or any other) where he posted code for this, thinking that there
> might be some discussion explaining the motivation, but I didn't find
> any.  I'm still in favor of erasing this distinction.

I cleared this with Thomas recently, on this very thread, and got a +1
from him on not listing him as an author. Still, I have no problem
crediting Thomas as an author instead of a reviewer, even though
you're now asking me to remove what little code he actually authored.
The distinction between secondary author and reviewer is often
blurred, anyway.

Whether or not Thomas is formally a co-author is ambiguous, and not
something that I feel strongly about (there is no ambiguity about the
fact that he made a very useful contribution, though -- he certainly
did, both directly and indirectly). I already went out of my way to
ensure that Heikki receives a credit for parallel CREATE INDEX in the
v11 release notes, even though I don't think that there is any formal
rule requiring me to do so -- he *didn't* write even one line of code
in this patch. (That was just my take on another ambiguous question
about authorship.)

I suggest that we revisit this when you're just about to commit the
patch. Or you can just add his name -- I like to err on the side of
being inclusive.

>> If you think about this in terms of the interface rather than the
>> implementation, then it may make more sense. The encapsulation adds
>> something which might pay off later, such as when extendBufFile()
>> needs to work with a concatenated set of BufFiles. And even right now,
>> I cannot simply reuse the BufFile without then losing the assert that
>> is currently in BufFileViewAppend() (must not have associated shared
>> fileset assert). So I'd end up asserting less (rather than more) there
>> if BufFileView() was removed.
>
> I would see the encapsulation as having some value if the original
> BufFile remained valid and the new view were also valid.  Then the
> BufFileView operation is a bit like a copy-on-write filesystem
> snapshot: you have the original, which you can do stuff with, and you
> have a copy, which can be manipulated independently, but the copying
> is cheap.  But here the BufFile gobbles up the original so I don't see
> the point.

I'll see what I can do about this.

>> I think I still get the gist of what you're saying, though. I've come
>> up with a new structure that is a noticeable improvement on what I
>> had. Importantly, the new structure let me add a number of
>> parallelism-agnostic asserts that make sure that every ambuild routine
>> that supports parallelism gets the details right.
>
> Yes, that looks better.  I'm slightly dubious that the new Asserts()
> are worthwhile, but I guess it's OK.

Bear in mind that the asserts basically amount to a check that the am
propagated indexInfo->ii_Concurrent correct within workers. It's nice
to be able to do this in a way that applies equally well to the serial
case.

> But I think it would be better
> to ditch the if-statement and do it like this:
>
> Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
> Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin)
>     : !TransactionIdIsValid(OldestXmin));
> Assert(snapshot == SnapshotAny || !anyvisible);
>
> Also, I think you've got a little more than you need in terms of
> comments.  I would keep the comments for the serial case and parallel
> case and drop the earlier one that basically says the same thing:

Okay.

>> (ReindexIsProcessingIndex() issue with non-catalog tables)
>
> I agree that it isn't particularly likely, but if somebody found it
> worthwhile to insert guards against those cases, maybe we should
> preserve them instead of abandoning them.  It shouldn't be that hard
> to propagate those values from the leader to the workers.  The main
> difficulty there seems to be that we're creating the parallel context
> in nbtsort.c, while the state that would need to be propagated is
> private to index.c, but there are several ways to solve that problem.
> It looks to me like the most robust approach would be to just make
> that part of what parallel.c naturally does.  Patch for that attached.

If you think it's worth the cycles, then I have no objection. I will
point out that this means that everything that I say about
ReindexIsProcessingIndex() no longer applies, because the relevant
state will now be propagated. It doesn't need to be mentioned at all,
and I don't even need to forbid builds on catalogs.

Should I go ahead and restore builds on catalogs, and remove those
comments, on the assumption that your patch will be committed before
mine? Obviously parallel index builds on catalogs don't matter. OTOH,
why not? Perhaps it's like the debate around HOT that took place over
10 years ago, where Tom insisted that HOT work with catalogs on
general principle.

>> (It might make sense to allow this if parallel_leader_participation
>> was *purely* a testing GUC, only for use by by backend hackers, but
>> AFAICT it isn't.)
>
> As applied to parallel CREATE INDEX, it pretty much is just a testing
> GUC, which is why I was skeptical about leaving support for it in the
> patch.  There's no anticipated advantage to having the leader not
> participate -- unlike for parallel queries, where it is quite possible
> that setting parallel_leader_participation=off could be a win, even
> generally.  If you just have a Gather over a parallel sequential scan,
> it is unlikely that parallel_leader_participation=off will help; it
> will most likely hurt, at least up to the point where more
> participants become a bad idea in general due to contention.

It's unlikely to hurt much, since as you yourself said,
compute_parallel_worker() doesn't consider the leader's participation.
Actually, if we assume that compute_parallel_worker() is perfect, then
surely parallel_leader_participation=off would beat
parallel_leader_participation=on for CREATE INDEX -- it would allow us
to use the value that compute_parallel_worker() truly intended. Which
is the opposite of what you say about
parallel_leader_participation=off above.

I am only trying to understand your perspective here. I don't think
that parallel_leader_participation support is that important. I think
that parallel_leader_participation=off might be slightly useful as a
way of discouraging parallel CREATE INDEX on smaller tables, just like
it is for parallel sequential scan (though this hinges on specifically
disallowing "degenerate parallel scan" cases). More often, it will
make hardly any difference if parallel_leader_participation is on or
off.

> In other words, right now, parallel_leader_participation is not
> strictly a testing GUC, but if we make CREATE INDEX respect it, then
> we're pushing it towards being a GUC that you don't ever want to
> enable except for testing.  I'm still not sure that's a very good
> idea, but if we're going to do it, then surely we should be
> consistent.

I'm confused. I *don't* want it to be something that you can only use
for testing. I want to not hurt whatever case there is for the
parallel_leader_participation GUC being something that a DBA may tune
in production. I don't see the conflict here.

> It's true that having one worker and no parallel leader
> participation can never be better than just having the leader do it,
> but it is also true that having two leaders and no parallel leader
> participation can never be better than having 1 worker with leader
> participation.  I don't see a reason to treat those cases differently.

You must mean "having two workers and no parallel leader participation...".

The reason to treat those two cases differently is simple: One
couldn't possibly be desirable in production, and undermines the whole
idea of parallel_leader_participation being user visible by adding a
sharp edge. The other is likely to be pretty harmless, especially
because leader participation is generally pretty fudged, and our cost
model is fairly rough. The difference here isn't what is important;
avoiding doing something that we know couldn't possibly help under any
circumstances is important. I think that we should do that on general
principle.

As I said in a prior e-mail, even parallel query's use of
parallel_leader_participation is consistent with what I propose here,
practically speaking, because a partial path without leader
participation will always lose to a serial sequential scan path in
practice. The fact that the optimizer will create a partial path that
makes a useless "degenerate parallel scan" a *theoretical* possibility
is irrelevant, because the optimizer has its own way of making sure
that such a plan doesn't actually get picked. It has its way, and so I
must have my own.

> If we're going to keep parallel_leader_participation support here, I
> think the last hunk in config.sgml should read more like this:
>
> Allows the leader process to execute the query plan under
> <literal>Gather</literal> and <literal>Gather Merge</literal> nodes
> and to participate in parallel index builds.  The default is
> <literal>on</literal>.  For queries, setting this value to
> <literal>off</literal> reduces the likelihood that workers will become
> blocked because the leader is not reading tuples fast enough, but
> requires the leader process to wait for worker processes to start up
> before the first tuples can be produced.  The degree to which the
> leader can help or hinder performance depends on the plan type or
> index build strategy, number of workers and query duration.  For index
> builds, setting this value to <literal>off</literal> is expected to
> reduce performance, but may be useful for testing purposes.

Why is CREATE INDEX really that different in terms of the downside for
production DBAs? I think it's more accurate to say that it's not
expected to improve performance. What do you think?

>> I suspect that the parameters of any cost model for parallel CREATE
>> INDEX that we're prepared to consider for v11 are: "Use a number of
>> parallel workers that is one below the number at which the total
>> duration of the CREATE INDEX either stays the same or goes up".
>
> That's pretty much the definition of a correct cost model; the trick
> is how to implement it without an oracle.

Correct on its own terms, at least. What I meant to convey here is
that there is little scope to do better in v11 on distributed costs
for the system as a whole, and therefore little scope to improve the
cost model.

>> BTW, the 32MB per participant limit within plan_create_index_workers()
>> was chosen based on the realization that any higher value would make
>> having a default setting of 2 for max_parallel_maintenance_workers (to
>> match the max_parallel_workers_per_gather default) pointless when the
>> default maintenance_work_mem value of 64MB is in use.

> I see.  I think it's a good start.  I wonder in general whether it's
> better to add memory or add workers.  In other words, suppose I have a
> busy system where my index builds are slow.  Should I try to free up
> some memory so that I can raise maintenance_work_mem, or should I try
> to free up some CPU resources so I can raise
> max_parallel_maintenance_workers?

This is actually all about distributed costs, I think. Provided you
have a reasonably sympathetic index build, like say a random numeric
column index build, and the index won't be multiple gigabytes in size,
then 1MB of maintenance_work_mem still seems to win with parallelism.
This seems extremely "selfish", though -- that's going to incur a lot
of random I/O for an operation that is typically characterized by
sequential I/O. Plus, I bet you're using quite a bit more memory than
1MB, in the form of FS cache. It seems hard to lose if you don't care
about distributed costs, especially if it's a matter of using 1 or 2
parallel workers versus just doing a serial build. Granted, you go
into a 1MB of maintenance_work_mem case below where parallelism loses,
which seems to contradict my suggestion that you practically cannot
lose with parallelism. However, ISTM that you really went out of your
way to find a case that lost.

Of course, I'm not arguing that it's okay for parallel CREATE INDEX to
be selfish -- it isn't. I'm prepared to say that you shouldn't use
parallelism if you have 1MB of maintenance_work_mem, no matter how
much it seems to help (and though it might sound crazy, because it is,
it *can* help). I'm just surprised that you've not said a lot more
about distributed costs, because that's where all the potential
benefit seems to be. It happens to be an area that we have no history
of modelling in any way, which makes it hard, but that's the situation
we seem to be in.

> I also wonder what the next steps would be to make this whole thing
> scale better.  From the performance tests that have been performed so
> far, it seems like adding a modest number of workers definitely helps,
> but it tops out around 2-3x with 4-8 workers.  I understand from your
> previous comments that's typical of other databases.

Yes. This patch seems to have scalability that is very similar to the
scalability that you get with similar features in other systems. I
have not verified this through first hand experience/experiments,
because I don't have access to that stuff. But I have found numerous
reports related to more than one other system. I don't think that this
is the only goal that matters, but I do think that it's an interesting
data point.

> It also seems
> pretty clear that more memory helps but only to a point.  For
> instance, I just tried "create index x on pgbench_accounts (aid)"
> without your patch at scale factor 1000.

Again, I discourage everyone from giving too much weight to index
builds like this one. This does not involve very much sorting at all,
because everything is already in order, and the comparisons are cheap
int4 comparisons. It may already be very I/O bound before you start to
use parallelism.

> With maintenance_work_mem =
> 1MB, it generated 6689 runs and took 131 seconds.  With
> maintenance_work_mem = 64MB, it took 67 seconds.  With
> maintenance_work_mem = 1GB, it took 60 seconds.  More memory didn't
> help, even if the sort could be made entirely internal. This seems to
> be a fairly typical pattern: using enough memory can buy you a small
> multiple, using a bunch of workers can buy you a small multiple, but
> then it just doesn't get faster.

Adding memory is just as likely to hurt slightly as help slightly,
especially if you're talking about CREATE INDEX, where being able to
use a final on-the-fly merge is a big deal (you can hide the cost of
the merging by doing it when you're very I/O bound anyway). This
should be true with only modest assumptions: I assume that you're in
one pass territory here, and that you have a reasonably small merge
heap (with perhaps no more than 100 runs). This seems likely to be
true the vast majority of the time with CREATE INDEX, assuming the
system is reasonably well configured. Roughly speaking, once my
assumptions are met, the exact number of runs almost doesn't matter
(that's at least useful as a mental model).

I basically disagree with the statement "using enough memory can buy
you a small multiple", since it's only true when you started out using
an unreasonably small amount of memory. Bear in mind that every time
maintenance_work_mem is doubled, our capacity to do sorts in one pass
quadruples. Using 1MB of maintenance_work_mem just doesn't make sense
*economically*, unless, perhaps, you care about neither the duration
of the CREATE INDEX statement, nor your electricity bill. You cannot
extrapolate anything useful from an index build that uses only 1MB of
maintenance_work_mem for all kinds of reasons.

I suggest taking another look at Prabhat's results. Here are my
observations about them:

* For serial sorts, a person reading his results could be forgiven for
thinking that increasing the amount of memory for a sort makes it go
*slower*, at least by a bit.

* Sometimes that doesn't happen for serial sorts, and sometimes it
does happen for parallel sorts, but mostly it hurts serial sorts and
helps parallel sorts, since Prabhat didn't start with an unreasonable
low amount of maintenance_work_mem.

* All the indexes are built against the same table. For the serial
cases, among each index that was built, the longest build took about
6x more time than the shortest. For parallel builds, it's more like a
3x difference. The difference gets smaller when you eliminate cases
that actually have to do almost no sorting. This "3x vs. 6x"
difference matters a lot.

This suggests to me that parallel CREATE INDEX has proven itself as
something that can take a mostly CPU bound index build, and make it
into a mostly I/O bound index build. It also suggests that we can make
better use of memory with parallel CREATE INDEX only because workers
will still need to get a reasonable amount of memory. You definitely
don't want multiple passes in workers, but for the same reasons that
you don't want them in serial cases.

> Yet, in theory, it seems like if
> we're willing to provide essentially unlimited memory and CPU
> resources, we ought to be able to make this go almost arbitrarily
> fast.

The main reason that the scalability of CREATE INDEX has trouble
getting past about 3.5x in cases we've seen doesn't involve any
scalability theory: we're very much I/O bound during the merge,
because we have to actually write out the index, regardless of what
tuplesort does or doesn't do. I've seen over 4x improvements on
systems that have sufficient temp file sequential I/O bandwidth, and
reasonably sympathetic data distributions/types.

>> WFM. Also added documentation for the wait events to monitoring.sgml,
>> which I somehow missed the first time around.
>
> But you forgot to update the preceding "morerows" line, so the
> formatting will be all messed up.

Fixed.

>> I removed "really". The point of the comment is that we've already set
>> up temp tablespaces for the shared fileset in the parallel case.
>> Shared filesets figure out which tablespaces will be used up-front --
>> see SharedFileSetInit().
>
> So why not say it that way?  i.e. For parallel sorts, this should have
> been done already, but it doesn't matter if it gets done twice.

Okay.

> I don't see any reason not to make those contingent only on
> trace_sort.  The user can puzzle apart which messages are which from
> the PIDs in the logfile.

Okay. I have removed anything that restrains the verbosity of
trace_sort for the WORKER() case. I think that you were right about it
the first time, but I now think that this is going too far. I'm
letting it go, though.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 17, 2018 at 6:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I had forgotten about the previous discussion.  The sketch in my
> previous email supposed that we would use dynamic barriers since the
> whole point, after all, is to handle the fact that we don't know how
> many participants will really show up.  Thomas's idea seems to be that
> the leader will initialize the barrier based on the anticipated number
> of participants and then tell it to forget about the participants that
> don't materialize.  Of course, that would require that the leader
> somehow figure out how many participants didn't show up so that it can
> deduct then from the counter in the barrier.  And how is it going to
> do that?

I don't know; Thomas?

> It's true that the leader will know the value of nworkers_launched,
> but as the comment in LaunchParallelWorkers() says: "The caller must
> be able to tolerate ending up with fewer workers than expected, so
> there is no need to throw an error here if registration fails.  It
> wouldn't help much anyway, because registering the worker in no way
> guarantees that it will start up and initialize successfully."  So it
> seems to me that a much better plan than having the leader try to
> figure out how many workers failed to launch would be to just keep a
> count of how many workers did in fact launch.

> So my position (at least until Thomas or Andres shows up and tells me
> why I'm wrong) is that you can use the Barrier API just as it is
> without any yak-shaving, just by following the sketch I set out
> before.  The additional API I proposed in that sketch isn't really
> required, although it might be more efficient.  But it doesn't really
> matter: if that comes along later, it will be trivial to adjust the
> code to take advantage of it.

Okay. I'll work on adopting dynamic barriers in the way you described.
I just wanted to make sure that we're all on the same page about what
that looks like.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
Hi,

I'm mostly away from my computer this week -- sorry about that, but
here are a couple of quick answers to questions directed at me:

On Thu, Jan 18, 2018 at 4:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> While it certainly did occur to me that that was kind of weird, and I
>>> struggled with it on my own for a little while, I ultimately agreed
>>> with Thomas that it added something to have ltsConcatWorkerTapes()
>>> call some buffile function in every iteration of its loop.
>>> (BufFileView() + BufFileViewAppend() are code that Thomas actually
>>> wrote, though I added the asserts and comments myself.)
>>
>> Hmm, well, if Thomas contributed code to this patch, then he needs to
>> be listed as an author.  I went searching for an email on this thread
>> (or any other) where he posted code for this, thinking that there
>> might be some discussion explaining the motivation, but I didn't find
>> any.  I'm still in favor of erasing this distinction.
>
> I cleared this with Thomas recently, on this very thread, and got a +1
> from him on not listing him as an author. Still, I have no problem
> crediting Thomas as an author instead of a reviewer, even though
> you're now asking me to remove what little code he actually authored.
> The distinction between secondary author and reviewer is often
> blurred, anyway.

The confusion comes about because I gave some small code fragments to
Rushabh for the BufFileView stuff off-list, when suggesting ideas for
how to integrate Peter's patch with some ancestor of my SharedFileSet
patch.  It was just a sketch and whether or not any traces remain in
the final commit, please credit me as a reviewer.  I need to review
more patches!  /me ducks

No objections from me if you hate the "view" idea or implementation
and think it's better to make a destructive append-BufFile-to-BufFile
operation instead.

On Thu, Jan 18, 2018 at 4:28 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 17, 2018 at 6:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I had forgotten about the previous discussion.  The sketch in my
>> previous email supposed that we would use dynamic barriers since the
>> whole point, after all, is to handle the fact that we don't know how
>> many participants will really show up.  Thomas's idea seems to be that
>> the leader will initialize the barrier based on the anticipated number
>> of participants and then tell it to forget about the participants that
>> don't materialize.  Of course, that would require that the leader
>> somehow figure out how many participants didn't show up so that it can
>> deduct then from the counter in the barrier.  And how is it going to
>> do that?
>
> I don't know; Thomas?

The idea I mentioned would only work if nworkers_launched is never
over-reported in a scenario that doesn't error out or crash, and never
under-reported in any scenario.  Otherwise static barriers may be even
less useful than I thought.

>> It's true that the leader will know the value of nworkers_launched,
>> but as the comment in LaunchParallelWorkers() says: "The caller must
>> be able to tolerate ending up with fewer workers than expected, so
>> there is no need to throw an error here if registration fails.  It
>> wouldn't help much anyway, because registering the worker in no way
>> guarantees that it will start up and initialize successfully."  So it
>> seems to me that a much better plan than having the leader try to
>> figure out how many workers failed to launch would be to just keep a
>> count of how many workers did in fact launch.

(If nworkers_launched can be silently over-reported, then does
parallel_leader_participation = off have a bug?  If no workers really
launched and reached the main executor loop but nworkers_launched > 0,
then no one is running the plan.)

>> So my position (at least until Thomas or Andres shows up and tells me
>> why I'm wrong) is that you can use the Barrier API just as it is
>> without any yak-shaving, just by following the sketch I set out
>> before.  The additional API I proposed in that sketch isn't really
>> required, although it might be more efficient.  But it doesn't really
>> matter: if that comes along later, it will be trivial to adjust the
>> code to take advantage of it.

Yeah, the dynamic Barrier API was intended for things like this.  I
was only trying to provide a simpler-to-use alternative that I thought
might work for this particular case (but not executor nodes, which
have another source of uncertainty about party size).  It sounds like
it's not actually workable though, and the dynamic API may be the only
way.  So the patch would have to deal with explicit phases.

> Okay. I'll work on adopting dynamic barriers in the way you described.
> I just wanted to make sure that we're all on the same page about what
> that looks like.

Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach
and then just exit, you'll need to detach from the barrier both in the
case where the worker exits early because the phase is too high and
the case where you attach in in time to help and run to completion;
(2) maybe workers could use BarrierArriveAndDetach() at the end (the
leader needs to use BarrierArriveAndWait(), but the workers don't
really need to wait for each other before they exit, do they?); (3)
erm, maybe it's a problem that errors occurring in workers while the
leader is waiting at a barrier won't unblock the leader (we don't
detach from barriers on abort/exit) -- I'll look into this.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Thu, Jan 18, 2018 at 4:19 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hi,
>
> I'm mostly away from my computer this week -- sorry about that, but
> here are a couple of quick answers to questions directed at me:
>
> On Thu, Jan 18, 2018 at 4:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>> It's true that the leader will know the value of nworkers_launched,
>>> but as the comment in LaunchParallelWorkers() says: "The caller must
>>> be able to tolerate ending up with fewer workers than expected, so
>>> there is no need to throw an error here if registration fails.  It
>>> wouldn't help much anyway, because registering the worker in no way
>>> guarantees that it will start up and initialize successfully."  So it
>>> seems to me that a much better plan than having the leader try to
>>> figure out how many workers failed to launch would be to just keep a
>>> count of how many workers did in fact launch.
>
> (If nworkers_launched can be silently over-reported, then does
> parallel_leader_participation = off have a bug?
>

Yes, and it is being discussed in CF entry [1].

[1] - https://commitfest.postgresql.org/16/1341/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Thu, Jan 18, 2018 at 8:52 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>> (It might make sense to allow this if parallel_leader_participation
>>> was *purely* a testing GUC, only for use by by backend hackers, but
>>> AFAICT it isn't.)
>>
>> As applied to parallel CREATE INDEX, it pretty much is just a testing
>> GUC, which is why I was skeptical about leaving support for it in the
>> patch.  There's no anticipated advantage to having the leader not
>> participate -- unlike for parallel queries, where it is quite possible
>> that setting parallel_leader_participation=off could be a win, even
>> generally.  If you just have a Gather over a parallel sequential scan,
>> it is unlikely that parallel_leader_participation=off will help; it
>> will most likely hurt, at least up to the point where more
>> participants become a bad idea in general due to contention.
>
> It's unlikely to hurt much, since as you yourself said,
> compute_parallel_worker() doesn't consider the leader's participation.
> Actually, if we assume that compute_parallel_worker() is perfect, then
> surely parallel_leader_participation=off would beat
> parallel_leader_participation=on for CREATE INDEX -- it would allow us
> to use the value that compute_parallel_worker() truly intended. Which
> is the opposite of what you say about
> parallel_leader_participation=off above.
>
> I am only trying to understand your perspective here. I don't think
> that parallel_leader_participation support is that important. I think
> that parallel_leader_participation=off might be slightly useful as a
> way of discouraging parallel CREATE INDEX on smaller tables, just like
> it is for parallel sequential scan (though this hinges on specifically
> disallowing "degenerate parallel scan" cases). More often, it will
> make hardly any difference if parallel_leader_participation is on or
> off.
>
>> In other words, right now, parallel_leader_participation is not
>> strictly a testing GUC, but if we make CREATE INDEX respect it, then
>> we're pushing it towards being a GUC that you don't ever want to
>> enable except for testing.  I'm still not sure that's a very good
>> idea, but if we're going to do it, then surely we should be
>> consistent.
>

I see your point.  OTOH, I think we should have something for testing
purpose as that helps in catching the bugs and makes it easy to write
tests that cover worker part of the code.

>
> I'm confused. I *don't* want it to be something that you can only use
> for testing. I want to not hurt whatever case there is for the
> parallel_leader_participation GUC being something that a DBA may tune
> in production. I don't see the conflict here.
>
>> It's true that having one worker and no parallel leader
>> participation can never be better than just having the leader do it,
>> but it is also true that having two leaders and no parallel leader
>> participation can never be better than having 1 worker with leader
>> participation.  I don't see a reason to treat those cases differently.
>
> You must mean "having two workers and no parallel leader participation...".
>
> The reason to treat those two cases differently is simple: One
> couldn't possibly be desirable in production, and undermines the whole
> idea of parallel_leader_participation being user visible by adding a
> sharp edge. The other is likely to be pretty harmless, especially
> because leader participation is generally pretty fudged, and our cost
> model is fairly rough. The difference here isn't what is important;
> avoiding doing something that we know couldn't possibly help under any
> circumstances is important. I think that we should do that on general
> principle.
>
> As I said in a prior e-mail, even parallel query's use of
> parallel_leader_participation is consistent with what I propose here,
> practically speaking, because a partial path without leader
> participation will always lose to a serial sequential scan path in
> practice. The fact that the optimizer will create a partial path that
> makes a useless "degenerate parallel scan" a *theoretical* possibility
> is irrelevant, because the optimizer has its own way of making sure
> that such a plan doesn't actually get picked. It has its way, and so I
> must have my own.
>

Can you please elaborate what part of optimizer are you talking about
where without leader participation partial path will always lose to a
serial sequential scan path?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Jan 17, 2018 at 10:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> As I said in a prior e-mail, even parallel query's use of
> parallel_leader_participation is consistent with what I propose here,
> practically speaking, because a partial path without leader
> participation will always lose to a serial sequential scan path in
> practice.

Amit's reply to this part drew my attention to it.  I think this is
entirely false.  Consider an aggregate that doesn't support partial
aggregation, and a plan that looks like this:

Aggregate
-> Gather
  -> Parallel Seq Scan
    Filter: something fairly selective

It is quite possible for this to be superior to a non-parallel plan
even with only 1 worker and no parallel leader participation.  The
worker can evaluate the filter qual, and the leader can evaluate the
aggregate.  If the CPU costs of doing those computations are high
enough to outweigh the costs of shuffling tuples between backends, we
win.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Jan 18, 2018 at 5:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I'm mostly away from my computer this week -- sorry about that,

Yeah, seriously.  Since when it is OK for hackers to ever be away from
their computers?  :-)

> The idea I mentioned would only work if nworkers_launched is never
> over-reported in a scenario that doesn't error out or crash, and never
> under-reported in any scenario.  Otherwise static barriers may be even
> less useful than I thought.

I just went back to the thread on "parallel.c oblivion of
worker-startup failures" and refreshed my memory about what's going on
over there.  What's going on over there is (1) currently,
nworkers_launched can be over-reported in a scenario that doesn't
error out or crash and (2) I'm proposing to tighten things up so that
this is no longer the case.  Amit proposed making it the
responsibility of code that uses parallel.c to cope with
nworkers_launched being larger than the number that actually launched,
and my counter-proposal was to make it reliably ERROR when they don't
all launch.

So, thinking about this, I think that my proposal to use dynamic
barriers here seems like it will work regardless of who wins that
argument.  Your proposal to use static barriers and decrement the
party size based on the number of participants which fail to start
will work if I win that argument, but will not work if Amit wins that
argument.

It seems to me in general that dynamic barriers are to be preferred in
almost every circumstance, because static barriers require a longer
chain of assumptions.  We can't assume that the number of guests we
invite to the party will match the number that actually show up, so,
in the case of a static barrier, we have to make sure to adjust the
party size if some of the guests end up having to stay home with a
sick kid or their car breaks down or if they decide to go to the
neighbor's party instead.  Absentee guests are not intrinsically a
problem, but we have to make sure that we account for them in a
completely water-tight fashion.  On the other hand, with a dynamic
barrier, we don't need to worry about the guests that don't show up;
we only need to worry about the guests that DO show up.  As they come
in the door, we count them; as they leave, we count them again.  When
the numbers are equal, the party's over.  That seems more robust.

In particular, for parallel query, there is absolutely zero guarantee
that every worker reaches every plan node.  For a parallel utility
command, things seem a little better: we can assume  that workers are
started only for one particular purpose.  But even that might not be
true in the future.  For example, imagine a parallel CREATE INDEX on a
partitioned table that cascades to all children.  One can easily
imagine wanting to use the same workers for the whole operation and
spread them out across the pool of tasks much as Parallel Append does.
There's a good chance this will be faster than doing each index build
in turn with maximum parallelism.  And then the static barrier thing
goes right out the window again, because the number of participants is
determined dynamically.

I really struggle to think of any case where a static barrier is
better.   I mean, suppose we have an existing party and then decide to
hold a baking contest.  We'll use a barrier to separate the baking
phase from the judging phase.  One might think that, since the number
of participants is already decided, someone could initialize the
barrier with that number rather than making everyone attach.  But it
doesn't really work, because there's a race: while one process is
creating the barrier with participants = 10, the doctor's beeper goes
off and he leaves the party.  Now there could be some situation in
which we are absolutely certain that we know how many participants
we've got and it won't change, but I suspect that in almost every
scenario deciding to use a static barrier is going to be immediately
followed by a lot of angst about how we can be sure that the number of
participants will always be correct.

>> Okay. I'll work on adopting dynamic barriers in the way you described.
>> I just wanted to make sure that we're all on the same page about what
>> that looks like.
>
> Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach
> and then just exit, you'll need to detach from the barrier both in the
> case where the worker exits early because the phase is too high and
> the case where you attach in in time to help and run to completion;

In the first case, I guess this is because otherwise the other
participants will wait for us even though we're not really there any
more.  In the second case, I'm not sure why it matters whether we
detach.  If we've reached the highest possible phase number, nobody's
going to wait any more, so who cares?  (I mean, apart from tidiness.)

> (2) maybe workers could use BarrierArriveAndDetach() at the end (the
> leader needs to use BarrierArriveAndWait(), but the workers don't
> really need to wait for each other before they exit, do they?);

They don't need to wait for each other, but they do need to wait for
the leader, so I don't think this works.  Logically, there are two key
sequencing points.  First, the leader needs to wait for the workers to
finish sorting.  That's the barrier between phase 0 and phase 1.
Second, the workers need to wait for the leader to absorb their tapes.
That's the barrier between phase 1 and phase 2.  If the workers use
BarrierArriveAndWait to reach phase 1 and then BarrierArriveAndDetach,
they won't wait for the leader to be done adopting their tapes as they
do in the current patch.

But, hang on a minute.  Why do the workers need to wait for the leader
anyway?  Can't they just exit once they're done sorting?  I think the
original reason why this ended up in the patch is that we needed the
leader to assume ownership of the tapes to avoid having the tapes get
blown away when the worker exits.  But, IIUC, with sharedfileset.c,
that problem no longer exists.  The tapes are jointly owned by all of
the cooperating backends and the last one to detach from it will
remove them.  So, if the worker sorts, advertises that it's done in
shared memory, and exits, then nothing should get removed and the
leader can adopt the tapes whenever it gets around to it.

If that's correct, then we only need 2 phases, not 3.  Workers
BarrierAttach() before reading any data, exiting if the phase is not
0.  Otherwise, they then read data and sort it, then advertise the
final tape in shared memory, then BarrierArriveAndDetach().  The
leader does BarrierAttach() before launching any workers, then reads
data and sorts it if applicable, then does BarrierArriveAndWait().
When that returns, all workers are done sorting (and may or may not
have finished exiting) and the leader can take over their tapes and
everything is fine.  That's significantly simpler than my previous
outline, and also simpler than what the patch does today.

> (3)
> erm, maybe it's a problem that errors occurring in workers while the
> leader is waiting at a barrier won't unblock the leader (we don't
> detach from barriers on abort/exit) -- I'll look into this.

I think if there's an ERROR, the general parallelism machinery is
going to arrange to kill every worker, so nothing matters in that case
unless barrier waits ignore interrupts, which I'm pretty sure they
don't.  (Also: if they do, I'll hit the ceiling; that would be awful.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 18, 2018 at 6:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Amit's reply to this part drew my attention to it.  I think this is
> entirely false.  Consider an aggregate that doesn't support partial
> aggregation, and a plan that looks like this:
>
> Aggregate
> -> Gather
>   -> Parallel Seq Scan
>     Filter: something fairly selective
>
> It is quite possible for this to be superior to a non-parallel plan
> even with only 1 worker and no parallel leader participation.  The
> worker can evaluate the filter qual, and the leader can evaluate the
> aggregate.  If the CPU costs of doing those computations are high
> enough to outweigh the costs of shuffling tuples between backends, we
> win.

That seems pretty far fetched. But even if it wasn't, my position
would not change. This could happen only because the planner
determined that it was the cheapest plan when
parallel_leader_participation happened to be off. But clearly a
"degenerate parallel CREATE INDEX" will never be faster than a serial
CREATE INDEX, and there is a simple way to always avoid one. So why
not do so?

I give up. I'll go ahead and make parallel_leader_participation=off
allow a degenerate parallel CREATE INDEX in the next version. I think
that it will make parallel_leader_participation less useful, with no
upside, but there doesn't seem to be much more that I can do about
that.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Jan 17, 2018 at 10:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> If you think it's worth the cycles, then I have no objection. I will
> point out that this means that everything that I say about
> ReindexIsProcessingIndex() no longer applies, because the relevant
> state will now be propagated. It doesn't need to be mentioned at all,
> and I don't even need to forbid builds on catalogs.
>
> Should I go ahead and restore builds on catalogs, and remove those
> comments, on the assumption that your patch will be committed before
> mine? Obviously parallel index builds on catalogs don't matter. OTOH,
> why not? Perhaps it's like the debate around HOT that took place over
> 10 years ago, where Tom insisted that HOT work with catalogs on
> general principle.

Yes, I think so.  If you (or someone else) can review that patch, I'll
go ahead and commit it, and then your patch can treat it as a solved
problem.  I'm not really worried about the cycles; the amount of
effort required here is surely very small compared to all of the other
things that have to be done when starting a parallel worker.

I'm not as dogmatic about the idea that everything must support system
catalogs or it's not worth doing as Tom is, but I do think it's better
if it can be done that way with reasonable effort.  When each new
feature comes with a set of unsupported corner cases, it becomes hard
for users to understand what will and will not actually work.  Now,
really big features like parallel query or partitioning or logical
replication generally do need to exclude some things in v1 or you can
never finish the project, but in this case plugging the gap seems
quite feasible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Jan 18, 2018 at 1:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> That seems pretty far fetched.

I don't think it is, and there are plenty of other examples.  All you
need is a query plan that involves significant CPU work both below the
Gather node and above the Gather node.  It's not difficult to find
plans like that; there are TPC-H queries that generate plans like
that.

> But even if it wasn't, my position
> would not change. This could happen only because the planner
> determined that it was the cheapest plan when
> parallel_leader_participation happened to be off. But clearly a
> "degenerate parallel CREATE INDEX" will never be faster than a serial
> CREATE INDEX, and there is a simple way to always avoid one. So why
> not do so?

That's an excellent argument for making parallel CREATE INDEX ignore
parallel_leader_participation entirely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 18, 2018 at 6:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I see your point.  OTOH, I think we should have something for testing
> purpose as that helps in catching the bugs and makes it easy to write
> tests that cover worker part of the code.

This is about the question of whether or not we want to allow
parallel_leader_participation to prevent or allow a parallel CREATE
INDEX that has 1 parallel worker that does all the sorting, with the
leader simply consuming its output without doing any merging (a
"degenerate paralllel CREATE INDEX"). It is perhaps only secondarily
about the question of ripping out parallel_leader_participation
entirely.

> Can you please elaborate what part of optimizer are you talking about
> where without leader participation partial path will always lose to a
> serial sequential scan path?

See my remarks to Robert just now.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 18, 2018 at 10:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 18, 2018 at 1:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> That seems pretty far fetched.
>
> I don't think it is, and there are plenty of other examples.  All you
> need is a query plan that involves significant CPU work both below the
> Gather node and above the Gather node.  It's not difficult to find
> plans like that; there are TPC-H queries that generate plans like
> that.

You need to have a very selective qual in the worker, that eliminates
most input (keeps the worker busy), and yet manages to keep the leader
busy rather than waiting on input from the gather.

>> But even if it wasn't, my position
>> would not change. This could happen only because the planner
>> determined that it was the cheapest plan when
>> parallel_leader_participation happened to be off. But clearly a
>> "degenerate parallel CREATE INDEX" will never be faster than a serial
>> CREATE INDEX, and there is a simple way to always avoid one. So why
>> not do so?
>
> That's an excellent argument for making parallel CREATE INDEX ignore
> parallel_leader_participation entirely.

I'm done making arguments about parallel_leader_participation. Tell me
what you want, and I'll do it.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 18, 2018 at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Should I go ahead and restore builds on catalogs, and remove those
>> comments, on the assumption that your patch will be committed before
>> mine? Obviously parallel index builds on catalogs don't matter. OTOH,
>> why not? Perhaps it's like the debate around HOT that took place over
>> 10 years ago, where Tom insisted that HOT work with catalogs on
>> general principle.
>
> Yes, I think so.  If you (or someone else) can review that patch, I'll
> go ahead and commit it, and then your patch can treat it as a solved
> problem.  I'm not really worried about the cycles; the amount of
> effort required here is surely very small compared to all of the other
> things that have to be done when starting a parallel worker.

Review of your patch:

* SerializedReindexState could use some comments. At least a one liner
stating its basic purpose.

* The "System index reindexing support" comment block could do with a
passing acknowledgement of the fact that this is serialized for
parallel workers.

* Maybe the "Serialize reindex state" comment within
InitializeParallelDSM() should instead say something like "Serialize
indexes-pending-reindex state".

Other than that, looks good to me. It's a simple patch with a clear purpose.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I just went back to the thread on "parallel.c oblivion of
> worker-startup failures" and refreshed my memory about what's going on
> over there.  What's going on over there is (1) currently,
> nworkers_launched can be over-reported in a scenario that doesn't
> error out or crash and (2) I'm proposing to tighten things up so that
> this is no longer the case.

I think that we need to be able to rely on nworkers_launched to not
over-report the number of workers launched. To be fair to Amit, I
haven't actually gone off and studied the problem myself, so it's not
fair to dismiss his point of view. It nevertheless seems to me that it
makes life an awful lot easier to be able to rely on
nworkers_launched.

> So, thinking about this, I think that my proposal to use dynamic
> barriers here seems like it will work regardless of who wins that
> argument.  Your proposal to use static barriers and decrement the
> party size based on the number of participants which fail to start
> will work if I win that argument, but will not work if Amit wins that
> argument.

Sorry, but I've changed my mind. I don't think barriers owned by
tuplesort.c will work for us (though I think we will still need a
synchronization primitive within nbtsort.c). The design that Robert
sketched for using barriers seemed fine at first. But then I realized:
what about the case where you have 2 spools?

I now understand why Thomas thought that I'd end up using static
barriers, because I now see that dynamic barriers have problems of
their own if used by tuplesort.c, even with the trick of only having
participants actually participate on the condition that they show up
before the party is over (before there is no tuples left for the
worker to consume). The idea of the leader using nworkers_launched as
the assumed-launched number of workers is pretty much baked into my
patch, because my patch makes tuplesorts composable (e.g. nbtsort.c
uses two tuplesorts when there is a unique index build/2 spools).

Do individual workers need to be prepared to back out of the main
spool's sort, but not the spool2 sort (for unique index builds), or
vice-versa? Clearly that's untenable, because they're going to need to
have both as long as they're participating in a parallel CREATE INDEX
(of a unique index) -- IndexBuildHeapScan() expects both at the same
time, but there is a race condition when launching workers with 2
spools. So does nbtsort.c need to own the barrier instead? If it does,
and if that barrier subsumes the responsibilities of tuplesort.c's
condition variables, then I don't see how that can avoid causing a
mess due to confusion about phases across tuplesorts/spools.

nbtsort.c *will* need some synchronization primitive, actually, (I'm
thinking of a condition variable), but only because of the fact that
nbtsort.c happens to want to aggregate statistics about the sort at
the end (for pg_index) -- this doesn't seem like tuplesort's problem
at all. In general, it's very natural to just call
tuplesort_leader_wait(), and have all the relevant details
encapsulated within tuplesort.c. We could make tuplesort_leader_wait()
totally optional, and just use the barrier within nbtsort.c for the
wait (more on that later).

> In particular, for parallel query, there is absolutely zero guarantee
> that every worker reaches every plan node.  For a parallel utility
> command, things seem a little better: we can assume  that workers are
> started only for one particular purpose.  But even that might not be
> true in the future.

I expect workers that are reported launched to show up eventually, or
report failure. They don't strictly have to do any work beyond just
showing up (finding no tuples, reaching tuplesort_performsort(), then
finally reaching tuplesort_end()). The spool2 issue I describe above
shows why this is. They own the state (tuplesort tuples) that they
consume, and may possibly have 2 or more tuplesorts. If they cannot do
the bare minimum of checking in with us, then we're in big trouble,
because that's indistinguishable from their having actually sorted
some tuples without our knowing that the leader ultimately gets to
consume them.

It wouldn't be impossible to use barriers for everything. That just
seems to be incompatible with tuplesorts being composable. Long ago,
nbtsort.c actually did the sorting, too. If that was still true, then
it would be rather a lot more like parallel hashjoin, I think. You
could then just have one barrier for one state machine (with one or
two spools). It seems clear that we should avoid teaching tuplesort.c
about nbtsort.c.

> But, hang on a minute.  Why do the workers need to wait for the leader
> anyway?  Can't they just exit once they're done sorting?  I think the
> original reason why this ended up in the patch is that we needed the
> leader to assume ownership of the tapes to avoid having the tapes get
> blown away when the worker exits.  But, IIUC, with sharedfileset.c,
> that problem no longer exists.

You're right. This is why we could make calling
tuplesort_leader_wait() optional. We only need one condition variable
in tuplesort.c. Which makes me even less inclined to make the
remaining workersFinishedCv condition variable into a barrier, since
it's not at all barrier-like. After all, workers don't care about each
other's progress, or where the leader is. The leader needs to wait
until all known-launched participants report having finished, which
seems like a typical reason to use a condition variable. That doesn't
seem phase-like at all. As for workers, they don't have phases ("done"
isn't a phase for them, because as you say, there is no need for them
to wait until the leader says they can go with the shared fileset
stuff -- that's the leader's problem alone.)

I guess the fact that tuplesort_leader_wait() could be optional means
that it could be removed, which means that we could in fact throw out
the last condition variable within tuplesort.c, and fully rely on
using a barrier for everything within nbtsort.c. However,
tuplesort_leader_wait() seems kind of like something that we should
have on general principle. And, more importantly, it would be tricky
to use a barrier even for this, because we still have that baked-in
assumption that nworkers_launched is the single source of truth about
the number of participants.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Jan 18, 2018 at 2:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Review of your patch:
>
> * SerializedReindexState could use some comments. At least a one liner
> stating its basic purpose.

Added a comment.

> * The "System index reindexing support" comment block could do with a
> passing acknowledgement of the fact that this is serialized for
> parallel workers.

Done.

> * Maybe the "Serialize reindex state" comment within
> InitializeParallelDSM() should instead say something like "Serialize
> indexes-pending-reindex state".

That would require corresponding changes in a bunch of other places,
possibly including the function names.  I think it's better to keep
the function names shorter and the comments matching the function
names, so I did not make this change.

> Other than that, looks good to me. It's a simple patch with a clear purpose.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 19, 2018 at 4:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Other than that, looks good to me. It's a simple patch with a clear purpose.
>
> Committed.

Cool.

Clarity on what I should do about parallel_leader_participation in the
next revision would be useful at this point. You seem to either want
me to remove it from consideration entirely, or to remove the code
that specifically disallows a "degenerate parallel CREATE INDEX". I
need a final answer on that.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 19, 2018 at 4:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Other than that, looks good to me. It's a simple patch with a clear purpose.
>>
>> Committed.
>
> Cool.
>
> Clarity on what I should do about parallel_leader_participation in the
> next revision would be useful at this point. You seem to either want
> me to remove it from consideration entirely, or to remove the code
> that specifically disallows a "degenerate parallel CREATE INDEX". I
> need a final answer on that.

Right.  I do think that we should do one of those things, and I lean
towards removing it entirely, but I'm not entirely sure.    Rather
than making an executive decision immediately, I'd like to wait a few
days to give others a chance to comment. I am hoping that we might get
some other opinions, especially from Thomas who implemented
parallel_leader_participation, or maybe Amit who has been reviewing
recently, or anyone else who is paying attention to this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Sat, Jan 20, 2018 at 6:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Clarity on what I should do about parallel_leader_participation in the
>> next revision would be useful at this point. You seem to either want
>> me to remove it from consideration entirely, or to remove the code
>> that specifically disallows a "degenerate parallel CREATE INDEX". I
>> need a final answer on that.
>
> Right.  I do think that we should do one of those things, and I lean
> towards removing it entirely, but I'm not entirely sure.    Rather
> than making an executive decision immediately, I'd like to wait a few
> days to give others a chance to comment. I am hoping that we might get
> some other opinions, especially from Thomas who implemented
> parallel_leader_participation, or maybe Amit who has been reviewing
> recently, or anyone else who is paying attention to this thread.

Well, I see parallel_leader_participation as having these reasons to exist:

1.  Gather could in rare circumstances not run the plan in the leader.
This can hide bugs.  It's good to be able to force that behaviour for
testing.

2.  Plans that tie up the leader process for a long time cause the
tuple queues to block, which reduces parallelism.  I speculate that
some people might want to turn that off in production, but at the very
least it seems useful for certain kinds of performance testing to be
able to remove this complication from the picture.

3.  The planner's estimations of parallel leader contribution are
somewhat bogus, especially if the startup cost is high.  It's useful
to be able to remove that problem from the picture sometimes, at least
for testing and development work.

Parallel CREATE INDEX doesn't have any of those problems.  The only
reason I can see for it to respect parallel_leader_participation = off
is for consistency with Gather.  If someone decides to run their
cluster with that setting, then it's slightly odd if CREATE INDEX
scans and sorts with one extra process, but it doesn't seem like a big
deal.

I vote for removing the GUC from consideration for now (ie always use
the leader), and revisiting the question again later when we have more
experience or if the parallel degree logic becomes more sophisticated
in future.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I guess the fact that tuplesort_leader_wait() could be optional means
> that it could be removed, which means that we could in fact throw out
> the last condition variable within tuplesort.c, and fully rely on
> using a barrier for everything within nbtsort.c. However,
> tuplesort_leader_wait() seems kind of like something that we should
> have on general principle. And, more importantly, it would be tricky
> to use a barrier even for this, because we still have that baked-in
> assumption that nworkers_launched is the single source of truth about
> the number of participants.

On third thought, tuplesort_leader_wait() should be removed entirely,
and tuplesort.c should get entirely out of the IPC business (it should
do the bare minimum of recording/reading a little state in shared
memory, while knowing nothing about condition variables, barriers, or
anything declared in parallel.h). Thinking about dealing with 2 spools
at once clinched it for me -- calling tuplesort_leader_wait() for both
underlying Tuplesortstates was silly, especially because there is
still a need for an nbtsort.c-specific wait for workers to fill-in
ambuild stats. When I said "tuplesort_leader_wait() seems kind of like
something that we should have on general principle", I was wrong.

It's normal for parallel workers to have all kinds of overlapping
responsibilities, and tuplesort_leader_wait() was doing something that
I now imagine isn't desirable to most callers. They can easily provide
something equivalent at a higher level. Besides, they'll very likely
be forced to anyway, due to some high level, caller-specific need --
which is exactly what we see within nbtsort.c.

Attached patch details:

* The patch synchronizes processes used the approach just described.
Note that this allowed me to remove several #include statements within
tuplesort.c.

* The patch uses only a single condition variable for a single wait
within nbtsort.c, for the leader. No barriers are used at all (and, as
I said, tuplesort.c doesn't use condition variables anymore). Since
things are now very simple, I can't imagine anyone still arguing for
the use of barriers.

Note that I'm still relying on nworkers_launched as the single source
of truth on the number of participants that can be expected to
eventually show up (even if they end up doing zero real work). This
should be fine, because I think that it will end up being formally
guaranteed to be reliable by the work currently underway from Robert
and Amit. But even if I'm wrong about things going that way, and it
turns out that the leader needs to decide how many putative launched
workers don't "get lost" due to fork() failure (the approach which
Amit apparently advocates), then there *still* isn't much that needs
to change.

Ultimately, the leader needs to have the exact number of workers that
participated, because that's fundamental to the tuplesort approach to
parallel sort. If necessary, the leader can just figure it out in
whatever way it likes at one central point within nbtsort.c, before
the leader calls its main spool's tuplesort_begin_index_btree() --
that can happen fairly late in the process. Actually doing that (and
not just using nworkers_launched) seems questionable to me, because it
would be almost the first thing that the leader would do after
starting parallel mode -- why not just have the parallel
infrastructure do it for us, and for everyone else?

If the new tuplesort infrastructure is used in the executor at some
future date, then the leader will still need to figure out the number
of workers that reached tuplesort_begin* some other way. This
shouldn't be surprising to anyone -- tuplesort.h is very clear on this
point.

* I revised the tuplesort.h contract to account for the developments
already described (mostly that I've removed tuplesort_leader_wait()).

* The patch makes the IPC wait event CREATE INDEX specific, since
tuplesort no longer does any waits of its own -- it's now called
ParallelCreateIndexScan. Patch also removes the second wait event
entirely (the one that we called ParallelSortTapeHandover).

* We now support index builds on catalogs.

I rebased on top of Robert's recent "REINDEX state in parallel
workers" commit, 29d58fd3. Note that there was a bug here in error
paths that caused Robert's "can't happen" error to be raised (the
PG_CATCH() block call to ResetReindexProcessing()). I fixed this in
passing, by simply removing that one "can't happen" error. Note that
ResetReindexProcessing() is only called directly within
reindex_index()/IndexCheckExclusion(). This made the idea of
preserving the check in a diminished form (#includ'ing parallel.h
within index.c, in order to check if we're a parallel worker as a
condition of raising that "can't happen" error) seem unnecessary.

* The patch does not alter anything about
parallel_leader_participation, except the alterations that Robert
requested to the docs (he requested these alterations on the
assumption that we won't end up doing nothing special with
parallel_leader_participation).

I am waiting for a final decision on what is to be done about
parallel_leader_participation, but for now I've changed nothing.

* I removed BufFileView(). I also renamed BufFileViewAppend() to
BufFileAppend().

* I performed some other minor tweaks, including some requested by
Robert in his most recent round of review.

Thanks
-- 
Peter Geoghegan

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sat, Jan 20, 2018 at 2:57 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sat, Jan 20, 2018 at 6:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> Clarity on what I should do about parallel_leader_participation in the
>>> next revision would be useful at this point. You seem to either want
>>> me to remove it from consideration entirely, or to remove the code
>>> that specifically disallows a "degenerate parallel CREATE INDEX". I
>>> need a final answer on that.
>>
>> Right.  I do think that we should do one of those things, and I lean
>> towards removing it entirely, but I'm not entirely sure.    Rather
>> than making an executive decision immediately, I'd like to wait a few
>> days to give others a chance to comment. I am hoping that we might get
>> some other opinions, especially from Thomas who implemented
>> parallel_leader_participation, or maybe Amit who has been reviewing
>> recently, or anyone else who is paying attention to this thread.
>
> Well, I see parallel_leader_participation as having these reasons to exist:
>
> 1.  Gather could in rare circumstances not run the plan in the leader.
> This can hide bugs.  It's good to be able to force that behaviour for
> testing.
>

Or reverse is also possible which means the workers won't get chance
to run the plan in which case we can use parallel_leader_participation
= off to test workers behavior.  As said before, I see only that as
the reason to keep parallel_leader_participation in this patch.  If we
decide to do that way, then I think we should remove the code that
specifically disallows a "degenerate parallel CREATE INDEX" as that
seems to be confusing.   If we go this way, then I think we should use
the wording suggested by Robert in one of its email [1] to describe
the usage of parallel_leader_participation.

BTW, is there any other way for "parallel create index" to force that
the work is done by workers?  I am insisting on having something which
can test the code path in workers because we have found quite a few
bugs using that idea.

[1] - https://www.postgresql.org/message-id/CA%2BTgmoYN-YQU9JsGQcqFLovZ-C%2BXgp1_xhJQad%3DcunGG-_p5gg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 19, 2018 at 6:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Or reverse is also possible which means the workers won't get chance
> to run the plan in which case we can use parallel_leader_participation
> = off to test workers behavior.  As said before, I see only that as
> the reason to keep parallel_leader_participation in this patch.  If we
> decide to do that way, then I think we should remove the code that
> specifically disallows a "degenerate parallel CREATE INDEX" as that
> seems to be confusing.   If we go this way, then I think we should use
> the wording suggested by Robert in one of its email [1] to describe
> the usage of parallel_leader_participation.

I agree that parallel_leader_participation is only useful for testing
in the context of parallel CREATE INDEX. My concern with allowing a
"degenerate parallel CREATE INDEX" to go ahead is that
parallel_leader_participation generally isn't just intended for
testing by hackers (if it was, then I wouldn't care). But I'm now more
than willing to let this go.

> BTW, is there any other way for "parallel create index" to force that
> the work is done by workers?  I am insisting on having something which
> can test the code path in workers because we have found quite a few
> bugs using that idea.

I agree that this is essential (more so than supporting
parallel_leader_participation). You can use the parallel_workers table
storage parameter for this. When the storage param has been set, we
don't care about the amount of memory available to each worker. You
can stress-test the implementation as needed. (The storage param does
care about max_parallel_maintenance_workers, but you can set that as
high as you like.)

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sat, Jan 20, 2018 at 8:33 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 19, 2018 at 6:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>> BTW, is there any other way for "parallel create index" to force that
>> the work is done by workers?  I am insisting on having something which
>> can test the code path in workers because we have found quite a few
>> bugs using that idea.
>
> I agree that this is essential (more so than supporting
> parallel_leader_participation). You can use the parallel_workers table
> storage parameter for this. When the storage param has been set, we
> don't care about the amount of memory available to each worker. You
> can stress-test the implementation as needed. (The storage param does
> care about max_parallel_maintenance_workers, but you can set that as
> high as you like.)
>

Right, but I think using parallel_leader_participation, you can do it
reliably and probably write some regression tests which can complete
in a predictable time.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 19, 2018 at 8:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Right, but I think using parallel_leader_participation, you can do it
> reliably and probably write some regression tests which can complete
> in a predictable time.

Do what reliably? Guarantee that the leader will not participate as a
worker, but that workers will be used? If so, yes, you can get that.

The only issue is that you may not be able to launch parallel workers
due to hitting a limit like max_parallel_workers, in which case you'll
get a serial index build despite everything. Nothing we can do about
that, though.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sat, Jan 20, 2018 at 10:20 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 19, 2018 at 8:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Right, but I think using parallel_leader_participation, you can do it
>> reliably and probably write some regression tests which can complete
>> in a predictable time.
>
> Do what reliably? Guarantee that the leader will not participate as a
> worker, but that workers will be used? If so, yes, you can get that.
>

Yes, that's what I mean.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sat, Jan 20, 2018 at 7:03 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached patch details:
>
> * The patch synchronizes processes used the approach just described.
> Note that this allowed me to remove several #include statements within
> tuplesort.c.
>
> * The patch uses only a single condition variable for a single wait
> within nbtsort.c, for the leader. No barriers are used at all (and, as
> I said, tuplesort.c doesn't use condition variables anymore). Since
> things are now very simple, I can't imagine anyone still arguing for
> the use of barriers.
>
> Note that I'm still relying on nworkers_launched as the single source
> of truth on the number of participants that can be expected to
> eventually show up (even if they end up doing zero real work). This
> should be fine, because I think that it will end up being formally
> guaranteed to be reliable by the work currently underway from Robert
> and Amit. But even if I'm wrong about things going that way, and it
> turns out that the leader needs to decide how many putative launched
> workers don't "get lost" due to fork() failure (the approach which
> Amit apparently advocates), then there *still* isn't much that needs
> to change.
>
> Ultimately, the leader needs to have the exact number of workers that
> participated, because that's fundamental to the tuplesort approach to
> parallel sort.
>

I think I can see why this patch needs that.  Is it mainly for the
work you are doing in _bt_leader_heapscan where you are waiting for
all the workers to be finished?

 If necessary, the leader can just figure it out in
> whatever way it likes at one central point within nbtsort.c, before
> the leader calls its main spool's tuplesort_begin_index_btree() --
> that can happen fairly late in the process. Actually doing that (and
> not just using nworkers_launched) seems questionable to me, because it
> would be almost the first thing that the leader would do after
> starting parallel mode -- why not just have the parallel
> infrastructure do it for us, and for everyone else?
>

I think till now we don't have any such requirement, but if it is must
for this patch, then I don't think it is tough to do that.  We need to
write an API WaitForParallelWorkerToAttach() and then call for each
launched worker or maybe WaitForParallelWorkersToAttach() which can
wait for all workers to attach and report how many have successfully
attached.   It will have functionality of
WaitForBackgroundWorkerStartup and additionally it needs to check if
the worker is attached to the error queue.  We already have similar
API (WaitForReplicationWorkerAttach) for logical replication workers
as well.  Note that it might have a slight impact on the performance
because with this you need to wait for the workers to startup before
doing any actual work, but I don't think it should be noticeable for
large operations especially for operations like parallel create index.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think I can see why this patch needs that.  Is it mainly for the
> work you are doing in _bt_leader_heapscan where you are waiting for
> all the workers to be finished?

Yes, though it's also needed for the leader tuplesort. It needs to be
able to discover worker runs by looking for temp files named 0 through
to $NWORKERS - 1.

The problem with seeing who shows up after a period of time, and
having the leader arbitrarily determine that to be the total number of
participants (while blocking further participants from joining) is
that I don't know *how long to wait*. This would probably work okay
for parallel CREATE INDEX, when the leader participates as a worker,
because you can check only when the leader is finished acting as a
worker. It stands to reason that that's enough time for worker
processes to at least show up, and be seen to show up. We can use the
duration of the leader's participation as a worker as a natural way to
decide how long to wait.

But what when the leader doesn't participate as a worker, for whatever
reason? Other uses for parallel tuplesort might typically have much
less leader participation as compared to parallel CREATE INDEX. In
short, ISTM that seeing who shows up is a bad strategy for parallel
tuplesort.

> I think till now we don't have any such requirement, but if it is must
> for this patch, then I don't think it is tough to do that.  We need to
> write an API WaitForParallelWorkerToAttach() and then call for each
> launched worker or maybe WaitForParallelWorkersToAttach() which can
> wait for all workers to attach and report how many have successfully
> attached.   It will have functionality of
> WaitForBackgroundWorkerStartup and additionally it needs to check if
> the worker is attached to the error queue.  We already have similar
> API (WaitForReplicationWorkerAttach) for logical replication workers
> as well.  Note that it might have a slight impact on the performance
> because with this you need to wait for the workers to startup before
> doing any actual work, but I don't think it should be noticeable for
> large operations especially for operations like parallel create index.

Actually, though it doesn't really look like it from the way things
are structured within nbtsort.c, I don't need to wait for workers to
start up (call the WaitForParallelWorkerToAttach() function you
sketched) before doing any real work within the leader. The leader can
participate as a worker, and only do this check afterwards. That will
work because the leader Tuplesortstate has yet to do any real work.
Nothing stops me from adding a new function to tuplesort, for the
leader, that lets the leader say: "New plan -- you should now expect
this many participants" (leader takes this reliable number from
eventual call to WaitForParallelWorkerToAttach()).

I admit that I had no idea that there is this issue with
nworkers_launched until very recently. But then, that field has
absolutely no comments.

--
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Sun, Jan 21, 2018 at 1:39 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Actually, though it doesn't really look like it from the way things
> are structured within nbtsort.c, I don't need to wait for workers to
> start up (call the WaitForParallelWorkerToAttach() function you
> sketched) before doing any real work within the leader. The leader can
> participate as a worker, and only do this check afterwards. That will
> work because the leader Tuplesortstate has yet to do any real work.
> Nothing stops me from adding a new function to tuplesort, for the
> leader, that lets the leader say: "New plan -- you should now expect
> this many participants" (leader takes this reliable number from
> eventual call to WaitForParallelWorkerToAttach()).
>
> I admit that I had no idea that there is this issue with
> nworkers_launched until very recently. But then, that field has
> absolutely no comments.
>

It would have been better if there were some comments besides that
field, but I think it has been covered at another place in the code.
See comments in LaunchParallelWorkers().

/*
* Start workers.
*
* The caller must be able to tolerate ending up with fewer workers than
* expected, so there is no need to throw an error here if registration
* fails.  It wouldn't help much anyway, because registering the worker in
* no way guarantees that it will start up and initialize successfully.
*/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sat, Jan 20, 2018 at 8:38 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> It would have been better if there were some comments besides that
> field, but I think it has been covered at another place in the code.
> See comments in LaunchParallelWorkers().
>
> /*
> * Start workers.
> *
> * The caller must be able to tolerate ending up with fewer workers than
> * expected, so there is no need to throw an error here if registration
> * fails.  It wouldn't help much anyway, because registering the worker in
> * no way guarantees that it will start up and initialize successfully.
> */

Why is this okay for Gather nodes, though? nodeGather.c looks at
pcxt->nworkers_launched during initialization, and appears to at least
trust it to indicate that more than zero actually-launched workers
will also show up when "nworkers_launched > 0". This trust seems critical
when parallel_leader_participation is off, because "node->nreaders ==
0" overrides the parallel_leader_participation GUC's setting (note
that node->nreaders comes directly from pcxt->nworkers_launched). If
zero workers show up, and parallel_leader_participation is off, but
pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
never make forward progress?

Parallel CREATE INDEX does go a bit further. It assumes that
nworkers_launched *exactly* indicates the number of workers that
successfully underwent parallel initialization, and therefore can be
expected to show up.

Is there actually a meaningful difference between the way
nworkers_launched is depended upon in each case, though?

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Mon, Jan 22, 2018 at 12:50 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Jan 20, 2018 at 8:38 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> It would have been better if there were some comments besides that
>> field, but I think it has been covered at another place in the code.
>> See comments in LaunchParallelWorkers().
>>
>> /*
>> * Start workers.
>> *
>> * The caller must be able to tolerate ending up with fewer workers than
>> * expected, so there is no need to throw an error here if registration
>> * fails.  It wouldn't help much anyway, because registering the worker in
>> * no way guarantees that it will start up and initialize successfully.
>> */
>
> Why is this okay for Gather nodes, though? nodeGather.c looks at
> pcxt->nworkers_launched during initialization, and appears to at least
> trust it to indicate that more than zero actually-launched workers
> will also show up when "nworkers_launched > 0". This trust seems critical
> when parallel_leader_participation is off, because "node->nreaders ==
> 0" overrides the parallel_leader_participation GUC's setting (note
> that node->nreaders comes directly from pcxt->nworkers_launched). If
> zero workers show up, and parallel_leader_participation is off, but
> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
> never make forward progress?
>

Ideally, that situation should be detected and we should throw an
error, but that doesn't happen today.  However, it will be handled
with Robert's patch on the other thread for CF entry [1].


[1] - https://commitfest.postgresql.org/16/1341/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sun, Jan 21, 2018 at 6:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Why is this okay for Gather nodes, though? nodeGather.c looks at
>> pcxt->nworkers_launched during initialization, and appears to at least
>> trust it to indicate that more than zero actually-launched workers
>> will also show up when "nworkers_launched > 0". This trust seems critical
>> when parallel_leader_participation is off, because "node->nreaders ==
>> 0" overrides the parallel_leader_participation GUC's setting (note
>> that node->nreaders comes directly from pcxt->nworkers_launched). If
>> zero workers show up, and parallel_leader_participation is off, but
>> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
>> never make forward progress?
>
> Ideally, that situation should be detected and we should throw an
> error, but that doesn't happen today.  However, it will be handled
> with Robert's patch on the other thread for CF entry [1].

I knew that, but I was confused by your sketch of the
WaitForParallelWorkerToAttach() API [1]. Specifically, your suggestion
that the problem was unique to nbtsort.c, or was at least something
that nbtsort.c had to take a special interest in. It now appears more
like a general problem with a general solution, and likely one that
won't need *any* changes to code in places like nodeGather.c (or
nbtsort.c, in the case of my patch).

I guess that you meant that parallel CREATE INDEX is the first thing
to care about the *precise* number of nworkers_launched -- that is
kind of a new thing. That doesn't seem like it makes any practical
difference to us, though. I don't see why nbtsort.c should take a
special interest in this problem, for example by calling
WaitForParallelWorkerToAttach() itself. I may have missed something,
but right now ISTM that it would be risky to make the API anything
other than what both nodeGather.c and nbtsort.c already expect (that
they'll either have nworkers_launched workers show up, or be able to
propagate an error).

[1] https://postgr.es/m/CAA4eK1KzvXTCFF8inhcEviUPxp4yWCS3rZuwjfqMttf75x2rvA@mail.gmail.com
-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Mon, Jan 22, 2018 at 10:36 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Jan 21, 2018 at 6:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Why is this okay for Gather nodes, though? nodeGather.c looks at
>>> pcxt->nworkers_launched during initialization, and appears to at least
>>> trust it to indicate that more than zero actually-launched workers
>>> will also show up when "nworkers_launched > 0". This trust seems critical
>>> when parallel_leader_participation is off, because "node->nreaders ==
>>> 0" overrides the parallel_leader_participation GUC's setting (note
>>> that node->nreaders comes directly from pcxt->nworkers_launched). If
>>> zero workers show up, and parallel_leader_participation is off, but
>>> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
>>> never make forward progress?
>>
>> Ideally, that situation should be detected and we should throw an
>> error, but that doesn't happen today.  However, it will be handled
>> with Robert's patch on the other thread for CF entry [1].
>
> I knew that, but I was confused by your sketch of the
> WaitForParallelWorkerToAttach() API [1]. Specifically, your suggestion
> that the problem was unique to nbtsort.c, or was at least something
> that nbtsort.c had to take a special interest in. It now appears more
> like a general problem with a general solution, and likely one that
> won't need *any* changes to code in places like nodeGather.c (or
> nbtsort.c, in the case of my patch).
>
> I guess that you meant that parallel CREATE INDEX is the first thing
> to care about the *precise* number of nworkers_launched -- that is
> kind of a new thing. That doesn't seem like it makes any practical
> difference to us, though. I don't see why nbtsort.c should take a
> special interest in this problem, for example by calling
> WaitForParallelWorkerToAttach() itself. I may have missed something,
> but right now ISTM that it would be risky to make the API anything
> other than what both nodeGather.c and nbtsort.c already expect (that
> they'll either have nworkers_launched workers show up, or be able to
> propagate an error).
>

The difference is that nodeGather.c doesn't have any logic like the
one you have in _bt_leader_heapscan where the patch waits for each
worker to increment nparticipantsdone.  For Gather node, we do such a
thing (wait for all workers to finish) by calling
WaitForParallelWorkersToFinish which will have the capability after
Robert's patch to detect if any worker is exited abnormally (fork
failure or failed before attaching to the error queue).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Jan 22, 2018 at 3:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The difference is that nodeGather.c doesn't have any logic like the
> one you have in _bt_leader_heapscan where the patch waits for each
> worker to increment nparticipantsdone.  For Gather node, we do such a
> thing (wait for all workers to finish) by calling
> WaitForParallelWorkersToFinish which will have the capability after
> Robert's patch to detect if any worker is exited abnormally (fork
> failure or failed before attaching to the error queue).

FWIW, I don't think that that's really much of a difference.

ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
similar to how _bt_end_parallel() calls
WaitForParallelWorkersToFinish() in the patch. The
_bt_leader_heapscan() condition variable wait for workers that you
refer to is quite a bit like how gather_readnext() behaves. It
generally checks to make sure that all tuple queues are done.
gather_readnext() can wait for developments using WaitLatch(), to make
sure every tuple queue is visited, with all output reliably consumed.

This doesn't look all that similar to  _bt_leader_heapscan(), I
suppose, but I think that that's only because it's normal for all
output to become available all at once for nbtsort.c workers. The
startup cost is close to or actually the same as the total cost, as it
*always* is for sort nodes.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> But, hang on a minute.  Why do the workers need to wait for the leader
> anyway?  Can't they just exit once they're done sorting?  I think the
> original reason why this ended up in the patch is that we needed the
> leader to assume ownership of the tapes to avoid having the tapes get
> blown away when the worker exits.  But, IIUC, with sharedfileset.c,
> that problem no longer exists.  The tapes are jointly owned by all of
> the cooperating backends and the last one to detach from it will
> remove them.  So, if the worker sorts, advertises that it's done in
> shared memory, and exits, then nothing should get removed and the
> leader can adopt the tapes whenever it gets around to it.

BTW, I want to point out that using the shared fileset infrastructure
is only a very small impediment to adding randomAccess support. If we
really wanted to support randomAccess for the leader's tapeset, while
recycling blocks from worker BufFiles, it looks like all we'd have to
do is change PathNameOpenTemporaryFile() to open files O_RDWR, rather
than O_RDONLY (shared fileset BufFiles that are opened after export
always have O_RDONLY segments -- we'd also have to change some
assertions, as well as some comments). Overall, this approach looks
straightforward, and isn't something that I can find an issue with
after an hour or so of manual testing.

Now, I'm not actually suggesting we go that way. As you know,
randomAccess isn't used by CREATE INDEX, and randomAccess may never be
needed for any parallel sort operation. More importantly, Thomas made
PathNameOpenTemporaryFile() use O_RDONLY for a reason, and I don't
want to trade one special case (randomAccess disallowed for parallel
tuplesort leader tapeset) in exchange for another one (the logtape.c
calls to BufFileOpenShared() ask for read-write BufFiles, not
read-only BufFiles).

I'm pointing this out because this is something that should increase
confidence in the changes I've proposed to logtape.c. The fact that
randomAccess support *would* be straightforward is a sign that I
haven't accidentally introduced some other assumption, or special
case.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Tue, Jan 23, 2018 at 1:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jan 22, 2018 at 3:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> The difference is that nodeGather.c doesn't have any logic like the
>> one you have in _bt_leader_heapscan where the patch waits for each
>> worker to increment nparticipantsdone.  For Gather node, we do such a
>> thing (wait for all workers to finish) by calling
>> WaitForParallelWorkersToFinish which will have the capability after
>> Robert's patch to detect if any worker is exited abnormally (fork
>> failure or failed before attaching to the error queue).
>
> FWIW, I don't think that that's really much of a difference.
>
> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
> similar to how _bt_end_parallel() calls
> WaitForParallelWorkersToFinish() in the patch. The
> _bt_leader_heapscan() condition variable wait for workers that you
> refer to is quite a bit like how gather_readnext() behaves. It
> generally checks to make sure that all tuple queues are done.
> gather_readnext() can wait for developments using WaitLatch(), to make
> sure every tuple queue is visited, with all output reliably consumed.
>

The difference lies in the fact that in gather_readnext, we use tuple
queue mechanism which has the capability to detect that the workers
are stopped/exited whereas _bt_leader_heapscan doesn't have any such
capability, so I think it will loop forever.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Jan 22, 2018 at 6:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> FWIW, I don't think that that's really much of a difference.
>>
>> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
>> similar to how _bt_end_parallel() calls
>> WaitForParallelWorkersToFinish() in the patch. The
>> _bt_leader_heapscan() condition variable wait for workers that you
>> refer to is quite a bit like how gather_readnext() behaves. It
>> generally checks to make sure that all tuple queues are done.
>> gather_readnext() can wait for developments using WaitLatch(), to make
>> sure every tuple queue is visited, with all output reliably consumed.
>>
>
> The difference lies in the fact that in gather_readnext, we use tuple
> queue mechanism which has the capability to detect that the workers
> are stopped/exited whereas _bt_leader_heapscan doesn't have any such
> capability, so I think it will loop forever.

_bt_leader_heapscan() can detect when workers exit early, at least in
the vast majority of cases. It can do this simply by processing
interrupts and automatically propagating any error -- nothing special
about that. It can also detect when workers have finished
successfully, because of course, that's the main reason for its
existence. What remains, exactly?

I don't know that much about tuple queues, but from a quick read I
guess you might be talking about shm_mq_receive() +
shm_mq_wait_internal(). It's not obvious that that will work in all
cases ("Note that if handle == NULL, and the process fails to attach,
we'll potentially get stuck here forever"). Also, I don't see how this
addresses the parallel_leader_participation issue I raised.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Tue, Jan 23, 2018 at 8:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jan 22, 2018 at 6:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> FWIW, I don't think that that's really much of a difference.
>>>
>>> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
>>> similar to how _bt_end_parallel() calls
>>> WaitForParallelWorkersToFinish() in the patch. The
>>> _bt_leader_heapscan() condition variable wait for workers that you
>>> refer to is quite a bit like how gather_readnext() behaves. It
>>> generally checks to make sure that all tuple queues are done.
>>> gather_readnext() can wait for developments using WaitLatch(), to make
>>> sure every tuple queue is visited, with all output reliably consumed.
>>>
>>
>> The difference lies in the fact that in gather_readnext, we use tuple
>> queue mechanism which has the capability to detect that the workers
>> are stopped/exited whereas _bt_leader_heapscan doesn't have any such
>> capability, so I think it will loop forever.
>
> _bt_leader_heapscan() can detect when workers exit early, at least in
> the vast majority of cases. It can do this simply by processing
> interrupts and automatically propagating any error -- nothing special
> about that. It can also detect when workers have finished
> successfully, because of course, that's the main reason for its
> existence. What remains, exactly?
>

Will it able to detect fork failure or if worker exits before
attaching to error queue?  I think you can once try it by forcing fork
failure in do_start_bgworker and see the behavior of
_bt_leader_heapscan.  I could have tried and let you know the results,
but the latest patch doesn't seem to apply cleanly.

> I don't know that much about tuple queues, but from a quick read I
> guess you might be talking about shm_mq_receive() +
> shm_mq_wait_internal(). It's not obvious that that will work in all
> cases ("Note that if handle == NULL, and the process fails to attach,
> we'll potentially get stuck here forever"). Also, I don't see how this
> addresses the parallel_leader_participation issue I raised.
>

I am talking about shm_mq_receive->shm_mq_counterparty_gone.  In
shm_mq_counterparty_gone, it can detect if the worker is gone by using
GetBackgroundWorkerPid.





-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Mon, Jan 22, 2018 at 10:13 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> _bt_leader_heapscan() can detect when workers exit early, at least in
> the vast majority of cases. It can do this simply by processing
> interrupts and automatically propagating any error -- nothing special
> about that. It can also detect when workers have finished
> successfully, because of course, that's the main reason for its
> existence. What remains, exactly?

As Amit says, what remains is the case where fork() fails or the
worker dies before it reaches the line in ParallelWorkerMain that
reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
signaled until you call WaitForParallelWorkersToFinish().  If you wait
prior to that point for a number of workers equal to
nworkers_launched, you will wait forever in those cases.

I am going to repeat my previous suggest that we use a Barrier here.
Given the discussion subsequent to my original proposal, this can be a
lot simpler than what I suggested originally.  Each worker does
BarrierAttach() before beginning to read tuples (exiting if the phase
returned is non-zero) and BarrierArriveAndDetach() when it's done
sorting.  The leader does BarrierAttach() before launching workers and
BarrierArriveAndWait() when it's done sorting.  If we don't do this,
we're going to have to invent some other mechanism to count the
participants that actually initialize successfully, but that seems
like it's just duplicating code.

This proposal has some minor advantages even when no fork() failure or
similar occurs.  If, for example, one or more workers take a long time
to start, the leader doesn't have to wait for them before writing out
the index.  As soon as all the workers that attached to the Barrier
have arrived at the end of phase 0, the leader can build a new tape
set from all of the tapes that exist at that time.  It does not need
to wait for the remaining workers to start up and create empty tapes.
This is only a minor advantage since we probably shouldn't be doing
CREATE INDEX in parallel in the first place if the index build is so
short that this scenario is likely to occur, but we get it basically
for free, so why not?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> As Amit says, what remains is the case where fork() fails or the
> worker dies before it reaches the line in ParallelWorkerMain that
> reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
> signaled until you call WaitForParallelWorkersToFinish().  If you wait
> prior to that point for a number of workers equal to
> nworkers_launched, you will wait forever in those cases.

Another option might be to actually call
WaitForParallelWorkersToFinish() in place of a condition variable or
barrier, as Amit suggested at one point.

> I am going to repeat my previous suggest that we use a Barrier here.
> Given the discussion subsequent to my original proposal, this can be a
> lot simpler than what I suggested originally.  Each worker does
> BarrierAttach() before beginning to read tuples (exiting if the phase
> returned is non-zero) and BarrierArriveAndDetach() when it's done
> sorting.  The leader does BarrierAttach() before launching workers and
> BarrierArriveAndWait() when it's done sorting.  If we don't do this,
> we're going to have to invent some other mechanism to count the
> participants that actually initialize successfully, but that seems
> like it's just duplicating code.

I think that this closes the door to leader non-participation as
anything other than a developer-only debug option, which might be
fine. If parallel_leader_participation=off (or some way of getting the
same behavior through a #define) is to be retained, then an artificial
wait is required as a substitute for the leader's participation as a
worker.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Jan 23, 2018 at 10:50 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I am going to repeat my previous suggest that we use a Barrier here.
>> Given the discussion subsequent to my original proposal, this can be a
>> lot simpler than what I suggested originally.  Each worker does
>> BarrierAttach() before beginning to read tuples (exiting if the phase
>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>> sorting.  The leader does BarrierAttach() before launching workers and
>> BarrierArriveAndWait() when it's done sorting.  If we don't do this,
>> we're going to have to invent some other mechanism to count the
>> participants that actually initialize successfully, but that seems
>> like it's just duplicating code.
>
> I think that this closes the door to leader non-participation as
> anything other than a developer-only debug option, which might be
> fine. If parallel_leader_participation=off (or some way of getting the
> same behavior through a #define) is to be retained, then an artificial
> wait is required as a substitute for the leader's participation as a
> worker.

This idea of an artificial wait seems pretty grotty to me. If we made
it one second, would that be okay with Valgrind builds? And when it
wasn't sufficient, wouldn't we be back to waiting forever?

Finally, it's still not clear to me why nodeGather.c's use of
parallel_leader_participation=off doesn't suffer from similar problems
[1].

[1] https://postgr.es/m/CAH2-Wz=cAMX5btE1s=aTz7CLwzpEPm_NsUhAMAo5t5=1i9VcwQ@mail.gmail.com
-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Jan 23, 2018 at 2:11 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Finally, it's still not clear to me why nodeGather.c's use of
> parallel_leader_participation=off doesn't suffer from similar problems
> [1].

Thomas and I just concluded that it does.  See my email on the other
thread just now.

I thought that I had the failure cases all nailed down here now, but I
guess not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Fri, Jan 19, 2018 at 6:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> (3)
>> erm, maybe it's a problem that errors occurring in workers while the
>> leader is waiting at a barrier won't unblock the leader (we don't
>> detach from barriers on abort/exit) -- I'll look into this.
>
> I think if there's an ERROR, the general parallelism machinery is
> going to arrange to kill every worker, so nothing matters in that case
> unless barrier waits ignore interrupts, which I'm pretty sure they
> don't.  (Also: if they do, I'll hit the ceiling; that would be awful.)

(After talking this through with Robert off-list).  Right, the
CHECK_FOR_INTERRUPTS() in ConditionVariableSleep() handles errors from
parallel workers.  There is no problem here.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Wed, Jan 24, 2018 at 12:20 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> As Amit says, what remains is the case where fork() fails or the
>> worker dies before it reaches the line in ParallelWorkerMain that
>> reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
>> signaled until you call WaitForParallelWorkersToFinish().  If you wait
>> prior to that point for a number of workers equal to
>> nworkers_launched, you will wait forever in those cases.
>
> Another option might be to actually call
> WaitForParallelWorkersToFinish() in place of a condition variable or
> barrier, as Amit suggested at one point.
>

Yes, the only thing that is slightly worrying about using
WaitForParallelWorkersToFinish is that backend leader needs to wait
for workers to finish rather than just finishing sort related work.  I
think there shouldn't be much difference between when the sort is done
and the workers actually finish the remaining resource cleanup.
However, OTOH, if we are not okay with this solution and want to go
with some kind of usage of barriers to solve this problem, then we can
evaluate that as well, but I feel it is better if we can use the
method which is used in other parallelism code to solve this problem
(which is to use WaitForParallelWorkersToFinish).

>> I am going to repeat my previous suggest that we use a Barrier here.
>> Given the discussion subsequent to my original proposal, this can be a
>> lot simpler than what I suggested originally.  Each worker does
>> BarrierAttach() before beginning to read tuples (exiting if the phase
>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>> sorting.  The leader does BarrierAttach() before launching workers and
>> BarrierArriveAndWait() when it's done sorting.
>>

How does leader detect if one of the workers does BarrierAttach and
then fails (either exits or error out) before doing
BarrierArriveAndDetach?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I am going to repeat my previous suggest that we use a Barrier here.
>>> Given the discussion subsequent to my original proposal, this can be a
>>> lot simpler than what I suggested originally.  Each worker does
>>> BarrierAttach() before beginning to read tuples (exiting if the phase
>>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>>> sorting.  The leader does BarrierAttach() before launching workers and
>>> BarrierArriveAndWait() when it's done sorting.
>
> How does leader detect if one of the workers does BarrierAttach and
> then fails (either exits or error out) before doing
> BarrierArriveAndDetach?

If you attach and then exit cleanly, that's a programming error and
would cause anyone who runs BarrierArriveAndWait() to hang forever.
If you attach and raise an error, the leader will receive an error
message via CFI() and will then raise an error itself and terminate
all workers during cleanup.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> I am going to repeat my previous suggest that we use a Barrier here.
>>>> Given the discussion subsequent to my original proposal, this can be a
>>>> lot simpler than what I suggested originally.  Each worker does
>>>> BarrierAttach() before beginning to read tuples (exiting if the phase
>>>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>>>> sorting.  The leader does BarrierAttach() before launching workers and
>>>> BarrierArriveAndWait() when it's done sorting.
>>
>> How does leader detect if one of the workers does BarrierAttach and
>> then fails (either exits or error out) before doing
>> BarrierArriveAndDetach?
>
> If you attach and then exit cleanly, that's a programming error and
> would cause anyone who runs BarrierArriveAndWait() to hang forever.
>

Right, but what if the worker dies due to something proc_exit(1) or
something like that before calling BarrierArriveAndWait.  I think this
is part of the problem we have solved in
WaitForParallelWorkersToFinish such that if the worker exits abruptly
at any point due to some reason, the system should not hang.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Wed, Jan 24, 2018 at 6:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>> I am going to repeat my previous suggest that we use a Barrier here.
>>>>> Given the discussion subsequent to my original proposal, this can be a
>>>>> lot simpler than what I suggested originally.  Each worker does
>>>>> BarrierAttach() before beginning to read tuples (exiting if the phase
>>>>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>>>>> sorting.  The leader does BarrierAttach() before launching workers and
>>>>> BarrierArriveAndWait() when it's done sorting.
>>>
>>> How does leader detect if one of the workers does BarrierAttach and
>>> then fails (either exits or error out) before doing
>>> BarrierArriveAndDetach?
>>
>> If you attach and then exit cleanly, that's a programming error and
>> would cause anyone who runs BarrierArriveAndWait() to hang forever.
>>
>
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait.  I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.

Actually what I said before is no longer true: after commit 2badb5af,
if you exit unexpectedly then the new ParallelWorkerShutdown() exit
hook delivers PROCSIG_PARALLEL_MESSAGE (apparently after detaching
from the error queue) and the leader aborts when it tries to read the
error queue.  I just hacked Parallel Hash like this:

                BarrierAttach(build_barrier);
+               if (ParallelWorkerNumber == 0)
+               {
+                       pg_usleep(1000000);
+                       proc_exit(1);
+               }

Now I see:

postgres=# select count(*) from foox r join foox s on r.a = s.a;
ERROR:  lost connection to parallel worker

Using a debugger I can see the leader raising that error with this stack:

HandleParallelMessages at parallel.c:890
ProcessInterrupts at postgres.c:3053
ConditionVariableSleep(cv=0x000000010a62e4c8,
wait_event_info=134217737) at condition_variable.c:151
BarrierArriveAndWait(barrier=0x000000010a62e4b0,
wait_event_info=134217737) at barrier.c:191
MultiExecParallelHash(node=0x00007ffcd9050b10) at nodeHash.c:312
MultiExecHash(node=0x00007ffcd9050b10) at nodeHash.c:112
MultiExecProcNode(node=0x00007ffcd9050b10) at execProcnode.c:502
ExecParallelHashJoin [inlined]
ExecHashJoinImpl(pstate=0x00007ffcda01baa0, parallel='\x01') at
nodeHashjoin.c:291
ExecParallelHashJoin(pstate=0x00007ffcda01baa0) at nodeHashjoin.c:582

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait.  I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.

I have used Thomas' chaos-monkey-fork-process.patch to verify:

1. The problem of fork failure causing nbtsort.c to wait forever is a
real problem. Sure enough, the coding pattern within
_bt_leader_heapscan() can cause us to wait forever even with commit
2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
consequence of the patch not using tuple queues (it uses the new
tuplesort sharing thing instead).

2. Simply adding a single call to WaitForParallelWorkersToFinish()
within _bt_leader_heapscan() before waiting on our condition variable
fixes the problem -- errors are reliably propagated, and we never end
up waiting forever.

3. This short term fix works just as well with
parallel_leader_participation=off.

At this point, my preferred solution is for someone to go implement
Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
like the logical person for the job). Once that's committed, I can
post a new version of the patch that uses that new infrastructure --
I'll add a call to the new function, without changing anything else.
Failing that, we could actually just use
WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
mostly because it complicates  parallel_leader_participation=off,
something that Amit is in agreement with [2][3].

For now, I am waiting for feedback from Robert on next steps.

[1] https://postgr.es/m/CAH2-Wzm6dF=g9LYwthgCqzRc4DzBE-8Tv28Yvg0XJ8Q6e4+cBQ@mail.gmail.com
[2] https://postgr.es/m/CAA4eK1LEFd28p1kw2Fst9LzgBgfMbDEq9wPh9jWFC0ye6ce62A%40mail.gmail.com
[3] https://postgr.es/m/CAA4eK1+a0OF4M231vBgPr_0Ygg_BNmRGZLiB7WQDE-FYBSyrGg@mail.gmail.com
-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>
> 1. The problem of fork failure causing nbtsort.c to wait forever is a
> real problem. Sure enough, the coding pattern within
> _bt_leader_heapscan() can cause us to wait forever even with commit
> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
> consequence of the patch not using tuple queues (it uses the new
> tuplesort sharing thing instead).

Just curious: does the attached also help?

> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
> within _bt_leader_heapscan() before waiting on our condition variable
> fixes the problem -- errors are reliably propagated, and we never end
> up waiting forever.

That does seem like a nice, simple solution and I am not against it.
The niggling thing that bothers me about it, though, is that it
requires the client of parallel.c to follow a slightly complicated
protocol or risk a rare obscure failure mode, and recognise the cases
where that's necessary.  Specifically, if you're not blocking in a
shm_mq wait loop, then you must make a call to this new interface
before you do any other kind of latch wait, but if you get that wrong
you'll probably not notice since fork failure is rare!  It seems like
it'd be nicer if we could figure out a way to make it so that any
latch/CFI loop would automatically be safe against fork failure.  The
attached (if it actually works, I dunno) is the worst way, but I
wonder if there is some way to traffic just a teensy bit more
information from postmaster to leader so that it could be efficient...

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>
>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>> real problem. Sure enough, the coding pattern within
>> _bt_leader_heapscan() can cause us to wait forever even with commit
>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>> consequence of the patch not using tuple queues (it uses the new
>> tuplesort sharing thing instead).
>
> Just curious: does the attached also help?

I can still reproduce the problem without the fix I described (which
does work), using your patch instead.

Offhand, I suspect that the way you set ParallelMessagePending may not
always leave it set when it should be.

>> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
>> within _bt_leader_heapscan() before waiting on our condition variable
>> fixes the problem -- errors are reliably propagated, and we never end
>> up waiting forever.
>
> That does seem like a nice, simple solution and I am not against it.
> The niggling thing that bothers me about it, though, is that it
> requires the client of parallel.c to follow a slightly complicated
> protocol or risk a rare obscure failure mode, and recognise the cases
> where that's necessary.  Specifically, if you're not blocking in a
> shm_mq wait loop, then you must make a call to this new interface
> before you do any other kind of latch wait, but if you get that wrong
> you'll probably not notice since fork failure is rare!  It seems like
> it'd be nicer if we could figure out a way to make it so that any
> latch/CFI loop would automatically be safe against fork failure.

It would certainly be nicer, but I don't see much risk if we add a
comment next to nworkers_launched that said: Don't trust this until
you've called (Amit's proposed) WaitForParallelWorkersToAttach()
function, unless you're using the tuple queue infrastructure, which
lets you not need to directly care about the distinction between a
launched worker never starting, and a launched worker successfully
completing.

While I agree with what Robert said on the other thread -- "I guess
that works, but it seems more like blind luck than good design.
Parallel CREATE INDEX fails to be as "lucky" as Gather" -- that
doesn't mean that that situation cannot be formalized. And even if it
isn't formalized, then I think that that will probably be because
Gather ends up doing almost the same thing.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Thu, Jan 25, 2018 at 9:28 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>>
>>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>>> real problem. Sure enough, the coding pattern within
>>> _bt_leader_heapscan() can cause us to wait forever even with commit
>>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>>> consequence of the patch not using tuple queues (it uses the new
>>> tuplesort sharing thing instead).
>>
>> Just curious: does the attached also help?
>
> I can still reproduce the problem without the fix I described (which
> does work), using your patch instead.
>
> Offhand, I suspect that the way you set ParallelMessagePending may not
> always leave it set when it should be.

Here's a version that works, and a minimal repro test module thing.
Without 0003 applied, it hangs.  With 0003 applied, it does this:

postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker

I won't be surprised if 0003 is judged to be a horrendous abuse of the
interrupt system, but these patches might at least be useful for
understanding the problem.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Jan 24, 2018 at 5:31 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here's a version that works, and a minimal repro test module thing.
> Without 0003 applied, it hangs.

I can confirm that this version does in fact fix the problem with
parallel CREATE INDEX hanging in the event of (simulated) worker
fork() failure. And, it seems to have at least one tiny advantage over
the other approaches I was talking about that you didn't mention,
which is that we never have to wait until the leader stops
participating as a worker before an error is raised. IOW, either the
whole parallel CREATE INDEX operation throws an error at an early
point in the CREATE INDEX, or the CREATE INDEX completely succeeds.

Obviously, the other, stated advantage is more relevant: *everyone*
automatically doesn't have to worry about nworkers_launched being
inaccurate this way, including code that gets away with this today
only due to using a tuple queue, such as nodeGather.c, but may not
always get away with it in the future.

I've run out of time to assess what you've done here in any real
depth. For now, I will say that this approach seems interesting to me.
I'll take a closer look tomorrow.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Thu, Jan 25, 2018 at 1:24 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Right, but what if the worker dies due to something proc_exit(1) or
>> something like that before calling BarrierArriveAndWait.  I think this
>> is part of the problem we have solved in
>> WaitForParallelWorkersToFinish such that if the worker exits abruptly
>> at any point due to some reason, the system should not hang.
>
> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>
> 1. The problem of fork failure causing nbtsort.c to wait forever is a
> real problem. Sure enough, the coding pattern within
> _bt_leader_heapscan() can cause us to wait forever even with commit
> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
> consequence of the patch not using tuple queues (it uses the new
> tuplesort sharing thing instead).
>
> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
> within _bt_leader_heapscan() before waiting on our condition variable
> fixes the problem -- errors are reliably propagated, and we never end
> up waiting forever.
>
> 3. This short term fix works just as well with
> parallel_leader_participation=off.
>
> At this point, my preferred solution is for someone to go implement
> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
> like the logical person for the job).
>

I can implement it and share a prototype patch with you which you can
use to test parallel sort stuff.  I would like to highlight the
difference which you will see with WaitForParallelWorkersToAttach as
compare to WaitForParallelWorkersToFinish() is that the former will
give you how many of nworkers_launched workers are actually launched
whereas latter gives an error if any of the expected workers is not
launched.  I feel former is good and your proposed way of calling it
after the leader is done with its work has alleviated the minor
disadvantage of this API which is that we need for workers to startup.

However, now I see that you and Thomas are trying to find a different
way to overcome this problem differently, so not sure if I should go
ahead or not.  I have seen that you told you wanted to look at
Thomas's proposed stuff carefully tomorrow, so I will wait for you
guys to decide which way is appropriate.

> Once that's committed, I can
> post a new version of the patch that uses that new infrastructure --
> I'll add a call to the new function, without changing anything else.
> Failing that, we could actually just use
> WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
> mostly because it complicates  parallel_leader_participation=off,
> something that Amit is in agreement with [2][3].
>

I think if we want we can use barrier API's to solve this problem, but
I kind of have a feeling that it doesn't seem to be the most
appropriate API, especially because existing API like
WaitForParallelWorkersToFinish() can serve the need in a similar way.

Just to conclude, following are proposed ways to solve this problem:

1. Implement a new API WaitForParallelWorkersToAttach and use that to
solve this problem.  Peter G. and Amit thinks, this is a good way to
solve this problem.
2. Use existing API WaitForParallelWorkersToFinish to solve this
problem.  Peter G. feels that if API mentioned in #1 is not available,
we can use this to solve the problem and I agree with that position.
Thomas is not against it.
3. Use Thomas's new way to detect such failures.  It is not clear to
me at this stage if any one of us have accepted it to be the way to
proceed, but Thomas and Peter G. want to investigate it further.
4. Use of  Barrier API to solve this problem.  Robert appears to be
strongly in favor of this approach.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Fri, Jan 26, 2018 at 11:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 25, 2018 at 1:24 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Right, but what if the worker dies due to something proc_exit(1) or
>>> something like that before calling BarrierArriveAndWait.  I think this
>>> is part of the problem we have solved in
>>> WaitForParallelWorkersToFinish such that if the worker exits abruptly
>>> at any point due to some reason, the system should not hang.
>>
>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>
>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>> real problem. Sure enough, the coding pattern within
>> _bt_leader_heapscan() can cause us to wait forever even with commit
>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>> consequence of the patch not using tuple queues (it uses the new
>> tuplesort sharing thing instead).
>>
>> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
>> within _bt_leader_heapscan() before waiting on our condition variable
>> fixes the problem -- errors are reliably propagated, and we never end
>> up waiting forever.
>>
>> 3. This short term fix works just as well with
>> parallel_leader_participation=off.
>>
>> At this point, my preferred solution is for someone to go implement
>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
>> like the logical person for the job).
>>
>
> I can implement it and share a prototype patch with you which you can
> use to test parallel sort stuff.  I would like to highlight the
> difference which you will see with WaitForParallelWorkersToAttach as
> compare to WaitForParallelWorkersToFinish() is that the former will
> give you how many of nworkers_launched workers are actually launched
> whereas latter gives an error if any of the expected workers is not
> launched.  I feel former is good and your proposed way of calling it
> after the leader is done with its work has alleviated the minor
> disadvantage of this API which is that we need for workers to startup.
>

/we need for workers to startup./we need to wait for workers to startup.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> At this point, my preferred solution is for someone to go implement
>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
>> like the logical person for the job).
>>
>
> I can implement it and share a prototype patch with you which you can
> use to test parallel sort stuff.

That would be great. Thank you.

> I would like to highlight the
> difference which you will see with WaitForParallelWorkersToAttach as
> compare to WaitForParallelWorkersToFinish() is that the former will
> give you how many of nworkers_launched workers are actually launched
> whereas latter gives an error if any of the expected workers is not
> launched.  I feel former is good and your proposed way of calling it
> after the leader is done with its work has alleviated the minor
> disadvantage of this API which is that we need for workers to startup.

I'm not sure that it makes much difference, though, since in the end
WaitForParallelWorkersToFinish() is called anyway, much like
nodeGather.c. Have I missed something?

I had imagined that WaitForParallelWorkersToAttach() would give me an
error in the style of WaitForParallelWorkersToFinish(), without
actually waiting for the parallel workers to finish.

> However, now I see that you and Thomas are trying to find a different
> way to overcome this problem differently, so not sure if I should go
> ahead or not.  I have seen that you told you wanted to look at
> Thomas's proposed stuff carefully tomorrow, so I will wait for you
> guys to decide which way is appropriate.

I suspect that the overhead of Thomas' experimental approach is going
to causes problems in certain cases. Cases that are hard to foresee.
That patch makes HandleParallelMessages() set ParallelMessagePending
artificially, pending confirmation of having launched all workers.

It was an interesting experiment, but I think that your
WaitForParallelWorkersToAttach() idea has a better chance of working
out.

>> Once that's committed, I can
>> post a new version of the patch that uses that new infrastructure --
>> I'll add a call to the new function, without changing anything else.
>> Failing that, we could actually just use
>> WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
>> mostly because it complicates  parallel_leader_participation=off,
>> something that Amit is in agreement with [2][3].
>>
>
> I think if we want we can use barrier API's to solve this problem, but
> I kind of have a feeling that it doesn't seem to be the most
> appropriate API, especially because existing API like
> WaitForParallelWorkersToFinish() can serve the need in a similar way.

I can't see a way in which using a barrier can have less complexity. I
think it will have quite a bit more, and I suspect that you share this
feeling.

> Just to conclude, following are proposed ways to solve this problem:
>
> 1. Implement a new API WaitForParallelWorkersToAttach and use that to
> solve this problem.  Peter G. and Amit thinks, this is a good way to
> solve this problem.
> 2. Use existing API WaitForParallelWorkersToFinish to solve this
> problem.  Peter G. feels that if API mentioned in #1 is not available,
> we can use this to solve the problem and I agree with that position.
> Thomas is not against it.
> 3. Use Thomas's new way to detect such failures.  It is not clear to
> me at this stage if any one of us have accepted it to be the way to
> proceed, but Thomas and Peter G. want to investigate it further.
> 4. Use of  Barrier API to solve this problem.  Robert appears to be
> strongly in favor of this approach.

That's a good summary.

The next revision of the patch will make the
leader-participates-as-worker spool/Tuplelsortstate start and finish
sorting before the main leader spool/Tuplelsortstate is even started.
I did this with the intention of making it very clear that my approach
does not assume a number of participants up-front -- that is actually
something we only need a final answer on at the point that the leader
merges, which is logically the last possible moment.

Hopefully this will reassure Robert. It is quite a small change, but
leads to a slightly cleaner organization within nbtsort.c, since
_bt_begin_parallel() is the only point that has to deal with leader
participation. Another minor advantage is that this makes the
trace_sort overheads/duration for each of the two tuplesort within the
leader non-overlapping (when the leader participates as a worker).

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Fri, Jan 26, 2018 at 12:00 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> At this point, my preferred solution is for someone to go implement
>>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
>>> like the logical person for the job).
>>>
>>
>> I can implement it and share a prototype patch with you which you can
>> use to test parallel sort stuff.
>
> That would be great. Thank you.
>
>> I would like to highlight the
>> difference which you will see with WaitForParallelWorkersToAttach as
>> compare to WaitForParallelWorkersToFinish() is that the former will
>> give you how many of nworkers_launched workers are actually launched
>> whereas latter gives an error if any of the expected workers is not
>> launched.  I feel former is good and your proposed way of calling it
>> after the leader is done with its work has alleviated the minor
>> disadvantage of this API which is that we need for workers to startup.
>
> I'm not sure that it makes much difference, though, since in the end
> WaitForParallelWorkersToFinish() is called anyway, much like
> nodeGather.c. Have I missed something?
>

Nopes, you are right.  I had in my mind that if we have something like
what I am proposing, then we don't even need to detect failures in
WaitForParallelWorkersToFinish and we can finish the work without
failing.

> I had imagined that WaitForParallelWorkersToAttach() would give me an
> error in the style of WaitForParallelWorkersToFinish(), without
> actually waiting for the parallel workers to finish.
>

I think that is also doable.  I will give it a try and report back if
I see any problem with it.  However, it might take me some time as I
am busy with few other things and I am planning to take two days off
for some personal reasons, OTOH if it turns out to be a simple (which
I expect it should be), then I will report back early.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I had imagined that WaitForParallelWorkersToAttach() would give me an
> error in the style of WaitForParallelWorkersToFinish(), without
> actually waiting for the parallel workers to finish.

+1.  If we're going to go that route, and that seems to be the
consensus, then I think an error is more appropriate than returning an
updated worker count.

On the question of whether this is better or worse than using
barriers, I'm not entirely sure.  I understand that various objections
to the Barrier concept have been raised, but I'm not personally
convinced by any of them.  On the other hand, if we only have to call
WaitForParallelWorkersToAttach after the leader finishes its own sort,
then there's no latency advantage to the barrier approach.  I suspect
we might still end up reworking this if we add the ability for new
workers to join an index build in medias res at some point in the
future -- but, as Peter points out, maybe the whole algorithm would
get reworked in that scenario.  So, since other people like
WaitForParallelWorkersToAttach, I think we can just go with that for
now.  I don't want to kill this patch with unnecessary nitpicking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 26, 2018 at 10:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I had imagined that WaitForParallelWorkersToAttach() would give me an
>> error in the style of WaitForParallelWorkersToFinish(), without
>> actually waiting for the parallel workers to finish.
>
> +1.  If we're going to go that route, and that seems to be the
> consensus, then I think an error is more appropriate than returning an
> updated worker count.

Great.

Should I wait for Amit's WaitForParallelWorkersToAttach() patch to be
posted, reviewed, and committed, or would you like to see what I came
up with ("The next revision of the patch will make the
leader-participates-as-worker spool/Tuplelsortstate start and finish
sorting before the main leader spool/Tuplelsortstate is even started")
today?

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Jan 26, 2018 at 1:17 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 26, 2018 at 10:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> I had imagined that WaitForParallelWorkersToAttach() would give me an
>>> error in the style of WaitForParallelWorkersToFinish(), without
>>> actually waiting for the parallel workers to finish.
>>
>> +1.  If we're going to go that route, and that seems to be the
>> consensus, then I think an error is more appropriate than returning an
>> updated worker count.
>
> Great.
>
> Should I wait for Amit's WaitForParallelWorkersToAttach() patch to be
> posted, reviewed, and committed, or would you like to see what I came
> up with ("The next revision of the patch will make the
> leader-participates-as-worker spool/Tuplelsortstate start and finish
> sorting before the main leader spool/Tuplelsortstate is even started")
> today?

I'm busy with other things, so no rush.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 26, 2018 at 10:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm busy with other things, so no rush.

Got it.

There is one question that I should probably get clarity on ahead of
the next revision, which is: Should I rip out the code that disallows
a "degenerate parallel CREATE INDEX" when
parallel_leader_participation=off, or should I instead rip out any
code that deals with parallel_leader_participation, and always have
the leader participate as a worker?

If I did the latter, then leader non-participation would live on as a
#define debug option within nbtsort.c. It definitely seems like we'd
want to preserve that at a minimum.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Jan 26, 2018 at 2:04 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 26, 2018 at 10:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm busy with other things, so no rush.
>
> Got it.
>
> There is one question that I should probably get clarity on ahead of
> the next revision, which is: Should I rip out the code that disallows
> a "degenerate parallel CREATE INDEX" when
> parallel_leader_participation=off, or should I instead rip out any
> code that deals with parallel_leader_participation, and always have
> the leader participate as a worker?
>
> If I did the latter, then leader non-participation would live on as a
> #define debug option within nbtsort.c. It definitely seems like we'd
> want to preserve that at a minimum.

Hmm, I like the idea of making it a #define instead of having it
depend on parallel_leader_participation.  Let's do that.  If the
consensus is later that it was the wrong decision, it'll be easy to
change it back.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Jan 26, 2018 at 11:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm, I like the idea of making it a #define instead of having it
> depend on parallel_leader_participation.  Let's do that.  If the
> consensus is later that it was the wrong decision, it'll be easy to
> change it back.

WFM.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Thomas Munro
Date:
On Fri, Jan 26, 2018 at 7:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> However, now I see that you and Thomas are trying to find a different
>> way to overcome this problem differently, so not sure if I should go
>> ahead or not.  I have seen that you told you wanted to look at
>> Thomas's proposed stuff carefully tomorrow, so I will wait for you
>> guys to decide which way is appropriate.
>
> I suspect that the overhead of Thomas' experimental approach is going
> to causes problems in certain cases. Cases that are hard to foresee.
> That patch makes HandleParallelMessages() set ParallelMessagePending
> artificially, pending confirmation of having launched all workers.
>
> It was an interesting experiment, but I think that your
> WaitForParallelWorkersToAttach() idea has a better chance of working
> out.

Thanks for looking into this.  Yeah.  I think you're right that it
could add a bit of overhead in some cases (ie if you receive a lot of
signals that AREN'T caused by fork failure, then you'll enter
HandleParallelMessage() every time unnecessarily), and it does feel a
bit kludgy.  The best idea I have to fix that so far is like this: (1)
add a member fork_failure_count to struct BackgroundWorkerArray, (2)
in do_start_bgworker() whenever fork fails, do
++BackgroundWorkerData->fork_failure_count (ie before a signal is sent
to the leader), (3) in procsignal_sigusr1_handler where we normally do
a bunch of CheckProcSignal(PROCSIG_XXX) stuff, if
(BackgroundWorkerData->fork_failure_count !=
last_observed_fork_failure_count) HandleParallelMessageInterrupt().
As far as I know, as long as fork_failure_count is (say) int32 (ie not
prone to tearing) then no locking is required due to the barriers
implicit in the syscalls involved there.  This is still slightly more
pessimistic than it needs to be (the failed fork may be for someone
else's ParallelContext), but only in rare cases so it would be
practically as good as precise PROCSIG delivery.  It's just that we
aren't allowed to deliver PROCSIGs from the postmaster.  We are
allowed to communicate through BackgroundWorkerData, and there is a
precedent for cluster-visible event counters in there already.

I think you should proceed with Amit's plan.  If we ever make a plan
like the above work in future, it'd render that redundant by turning
every CFI() into a cancellation point for fork failure, but I'm not
planning to investigate further given the muted response to my
scheming in this area so far.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Jan 26, 2018 at 6:40 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Thanks for looking into this.  Yeah.  I think you're right that it
> could add a bit of overhead in some cases (ie if you receive a lot of
> signals that AREN'T caused by fork failure, then you'll enter
> HandleParallelMessage() every time unnecessarily), and it does feel a
> bit kludgy.  The best idea I have to fix that so far is like this: (1)
> add a member fork_failure_count to struct BackgroundWorkerArray, (2)
> in do_start_bgworker() whenever fork fails, do
> ++BackgroundWorkerData->fork_failure_count (ie before a signal is sent
> to the leader), (3) in procsignal_sigusr1_handler where we normally do
> a bunch of CheckProcSignal(PROCSIG_XXX) stuff, if
> (BackgroundWorkerData->fork_failure_count !=
> last_observed_fork_failure_count) HandleParallelMessageInterrupt().
> As far as I know, as long as fork_failure_count is (say) int32 (ie not
> prone to tearing) then no locking is required due to the barriers
> implicit in the syscalls involved there.  This is still slightly more
> pessimistic than it needs to be (the failed fork may be for someone
> else's ParallelContext), but only in rare cases so it would be
> practically as good as precise PROCSIG delivery.  It's just that we
> aren't allowed to deliver PROCSIGs from the postmaster.  We are
> allowed to communicate through BackgroundWorkerData, and there is a
> precedent for cluster-visible event counters in there already.

I could sign on to that plan, but I don't think we should hold this
patch up for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Amit Kapila
Date:
On Fri, Jan 26, 2018 at 12:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 26, 2018 at 12:00 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> At this point, my preferred solution is for someone to go implement
>>>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
>>>> like the logical person for the job).
>>>>
>>>
>>> I can implement it and share a prototype patch with you which you can
>>> use to test parallel sort stuff.
>>
>> That would be great. Thank you.
>>
>>> I would like to highlight the
>>> difference which you will see with WaitForParallelWorkersToAttach as
>>> compare to WaitForParallelWorkersToFinish() is that the former will
>>> give you how many of nworkers_launched workers are actually launched
>>> whereas latter gives an error if any of the expected workers is not
>>> launched.  I feel former is good and your proposed way of calling it
>>> after the leader is done with its work has alleviated the minor
>>> disadvantage of this API which is that we need for workers to startup.
>>
>> I'm not sure that it makes much difference, though, since in the end
>> WaitForParallelWorkersToFinish() is called anyway, much like
>> nodeGather.c. Have I missed something?
>>
>
> Nopes, you are right.  I had in my mind that if we have something like
> what I am proposing, then we don't even need to detect failures in
> WaitForParallelWorkersToFinish and we can finish the work without
> failing.
>
>> I had imagined that WaitForParallelWorkersToAttach() would give me an
>> error in the style of WaitForParallelWorkersToFinish(), without
>> actually waiting for the parallel workers to finish.
>>
>
> I think that is also doable.  I will give it a try and report back if
> I see any problem with it.
>

I have posted the patch for the above API and posted it on a new
thread [1].  Do let me know either here or on that thread if the patch
suffices your need?

[1] -
https://www.postgresql.org/message-id/CAA4eK1%2Be2MzyouF5bg%3DOtyhDSX%2B%3DAo%3D3htN%3DT-r_6s3gCtKFiw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Sat, Jan 27, 2018 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have posted the patch for the above API and posted it on a new
> thread [1].  Do let me know either here or on that thread if the patch
> suffices your need?

I've responded to you over on that thread. Thanks again for helping me.

I already have a revision of my patch lined up that is coded to target
your new WaitForParallelWorkersToAttach() interface, plus some other
changes. These include:

* Make the leader's worker Tuplesortstate complete before the main
leader Tuplesortstate even begins, making it very clear that nbtsort.c
does not rely on knowing the number of launched workers up-front. That
should make Robert a bit happier about our ability to add additional
workers fairly late in the process, in a future tuplesort client that
finds that to be useful.

* I've added a new "parallel" argument to index_build(), which
controls whether or not we even call the plan_create_index_workers()
cost model. When this is false, we always do a serial build. This was
added because I noticed that TRUNCATE REINDEXes the table at a time
when parallelism couldn't possibly be useful, which still used
parallelism. Best to have the top-level caller opt in or opt out.

* Polished the docs some more.

* Improved commentary on randomAccess/writable leader handling within
logtape.c. We could still support that, if we were willing to make
shared Buffiles that are opened within another backend writable. I'm
not proposing to do that, but it's nice that we could.

I hesitate to post something that won't cleanly apply on the master
branch's tip, but otherwise I am ready to send this new revision of
the patch right away. It seems likely that Robert will commit your
patch within a matter of days, once some minor issues are worked
through, at which point I'll send what I have. if anyone prefers, I
can post the patch immediately, and break out the
WaitForParallelWorkersToAttach() as the second patch in a cumulative
patch set. Right now, I'm out of things to work on here.

Notes on how I've stress-tested parallel CREATE INDEX:

I can recommend using the amcheck heapallverified functionality [1]
from the Github version of amcheck to test this patch. You will need
to modify the call to IndexBuildHeapScan() that the extension makes,
to add a new NULL "scan" argument, since parallel CREATE INDEX changes
the signature of IndexBuildHeapScan(). That's trivial, though.

Note that parallel CREATE INDEX should produce relfiles that are
physically identical to a serial CREATE INDEX, since index tuplesorts
are generally always deterministic. IOW, we use a heap TID tie-breaker
within tuplesort.c from B-Tree index tuples, which assures us that
varying maintenance_work_mem won't affect the final output even in a
tiny, insignificant way -- using parallelism should not change
anything about the exact output, either. At one point I was testing
this patch by verifying not only that indexes were sane, but that they
were physically identical to what a serial sort (in the master branch)
produced (I only needed to mask page LSNs).

Finally, yet another good way to test this patch is to verify that
everything continues to work when MAX_PHYSICAL_FILESIZE is modified to
be BLCKSZ (2^13 rather than 2^30). You will get many many BufFile
segments that way, which could in theory reveal bugs in rare edge
cases that I haven't considered. This is a strategy that led to my
finding a bug in v10 at one point [2], as well as bugs in earlier
versions of Thomas' parallel hash join patch set. It worked for me
twice already, so it seems like a good strategy. It may be worth
*combining* with some other stress-testing strategy.

[1] https://github.com/petergeoghegan/amcheck#optional-heapallindexed-verification
[2] https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com
-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Jan 29, 2018 at 4:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Jan 27, 2018 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have posted the patch for the above API and posted it on a new
>> thread [1].  Do let me know either here or on that thread if the patch
>> suffices your need?
>
> I've responded to you over on that thread. Thanks again for helping me.
>
> I already have a revision of my patch lined up that is coded to target
> your new WaitForParallelWorkersToAttach() interface, plus some other
> changes.

Attached patch has these changes.

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Feb 2, 2018 at 11:16 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Attached patch has these changes.

And that patch you attached is also, now, committed.

If you could keep an eye on the buildfarm and investigate anything
that breaks, I would appreciate it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> And that patch you attached is also, now, committed.
>
> If you could keep an eye on the buildfarm and investigate anything
> that breaks, I would appreciate it.

Fantastic!

I can keep an eye on it throughout the day.

Thanks everyone
-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Thanks everyone

I would like to acknowledge the assistance of Corey Huinker with early
testing of the patch (this took place in 2016, and much of it was not
on-list). Even though he wasn't credited in the commit message, he
should appear in the V11 release notes reviewer list IMV. His
contribution certainly merits it.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Feb 2, 2018 at 3:23 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Thanks everyone
>
> I would like to acknowledge the assistance of Corey Huinker with early
> testing of the patch (this took place in 2016, and much of it was not
> on-list). Even though he wasn't credited in the commit message, he
> should appear in the V11 release notes reviewer list IMV. His
> contribution certainly merits it.

For the record, I typically construct the list of reviewers by reading
over the thread and adding all the people whose names I find there in
chronological order, excluding things that are clearly not review
(like "Bumped to next CF.") and opinions on narrow questions that
don't indicate that any code-reading or testing was done (like "+1 for
calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
copied Corey on the original email, but I see no posts from him on the
thread, which is why he didn't get included in the commit message.
While I have no problem with him being included in the release notes,
I obviously can't know about activity that happens entirely off-list.
If you mentioned somewhere in the 200+ message on this topic that he
should be included, I missed that, too.  I think it's much harder to
give credit adequately when contributions are off-list; letting
everyone know what's going on is why we have a list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> For the record, I typically construct the list of reviewers by reading
> over the thread and adding all the people whose names I find there in
> chronological order, excluding things that are clearly not review
> (like "Bumped to next CF.") and opinions on narrow questions that
> don't indicate that any code-reading or testing was done (like "+1 for
> calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
> copied Corey on the original email, but I see no posts from him on the
> thread, which is why he didn't get included in the commit message.

I did credit him in my own proposed commit message. I know that it's
not part of your workflow to preserve that, but I had assumed that
that would at least be taken into account.

Anyway, mistakes like this happen. I'm glad that we now have the
reviewer credit list, so that they can be corrected afterwards.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Feb 2, 2018 at 3:35 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> For the record, I typically construct the list of reviewers by reading
>> over the thread and adding all the people whose names I find there in
>> chronological order, excluding things that are clearly not review
>> (like "Bumped to next CF.") and opinions on narrow questions that
>> don't indicate that any code-reading or testing was done (like "+1 for
>> calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
>> copied Corey on the original email, but I see no posts from him on the
>> thread, which is why he didn't get included in the commit message.
>
> I did credit him in my own proposed commit message. I know that it's
> not part of your workflow to preserve that, but I had assumed that
> that would at least be taken into account.

Ah.  Sorry, I didn't look at that.  I try to remember to look at
proposed commit messages, but not everyone includes them, which is
probably part of the reason I don't always remember to look for them.
Or maybe I just have failed to adequately develop that habit...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> If you could keep an eye on the buildfarm and investigate anything
>> that breaks, I would appreciate it.

> I can keep an eye on it throughout the day.

There is a benign Valgrind error that causes the lousyjack animal to
report failure. It looks like this:

==6850== Syscall param write(buf) points to uninitialised byte(s)
==6850==    at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so)
==6850==    by 0x82328F: FileWrite (fd.c:2017)
==6850==    by 0x8261AD: BufFileDumpBuffer (buffile.c:513)
==6850==    by 0x826569: BufFileFlush (buffile.c:657)
==6850==    by 0x8262FB: BufFileRead (buffile.c:561)
==6850==    by 0x9F6C79: ltsReadBlock (logtape.c:273)
==6850==    by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906)
==6850==    by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477)
==6850==    by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499)
==6850==    by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823)

I'll need to go and write a Valgrind suppression for this. I'll get to
it later today.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Andres Freund
Date:
On 2018-02-02 13:35:59 -0800, Peter Geoghegan wrote:
> On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> > On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> If you could keep an eye on the buildfarm and investigate anything
> >> that breaks, I would appreciate it.
> 
> > I can keep an eye on it throughout the day.
> 
> There is a benign Valgrind error that causes the lousyjack animal to
> report failure. It looks like this:
> 
> ==6850== Syscall param write(buf) points to uninitialised byte(s)
> ==6850==    at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so)
> ==6850==    by 0x82328F: FileWrite (fd.c:2017)
> ==6850==    by 0x8261AD: BufFileDumpBuffer (buffile.c:513)
> ==6850==    by 0x826569: BufFileFlush (buffile.c:657)
> ==6850==    by 0x8262FB: BufFileRead (buffile.c:561)
> ==6850==    by 0x9F6C79: ltsReadBlock (logtape.c:273)
> ==6850==    by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906)
> ==6850==    by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477)
> ==6850==    by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499)
> ==6850==    by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823)

Not saying you're wrong, but you should include a comment on why this is
a benign warning. Presumably it's some padding memory somewhere, but
it's not obvious from the above bleat.

Greetings,

Andres Freund


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
> Not saying you're wrong, but you should include a comment on why this is
> a benign warning. Presumably it's some padding memory somewhere, but
> it's not obvious from the above bleat.

Sure. This looks slightly more complicated than first anticipated, but
I'll keep everyone posted.

Valgrind suppression aside, this raises another question. The stack
trace shows that the error happens during the creation of a new TOAST
table (CheckAndCreateToastTable()). I wonder if I should also pass
down a flag that makes sure that parallelism is never even attempted
from that path, to match TRUNCATE's suppression of parallel index
builds during its reindexing. It really shouldn't be a problem as
things stand, but maybe it's better to be consistent about "useless"
parallel CREATE INDEX attempts, and suppress them here too.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Fri, Feb 2, 2018 at 4:31 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
>> Not saying you're wrong, but you should include a comment on why this is
>> a benign warning. Presumably it's some padding memory somewhere, but
>> it's not obvious from the above bleat.
>
> Sure. This looks slightly more complicated than first anticipated, but
> I'll keep everyone posted.

I couldn't make up my mind if it was best to prevent the uninitialized
write(), or to instead just add a suppression. I eventually decided
upon the suppression -- see attached patch. My proposed commit message
has a full explanation of the Valgrind issue, which I won't repeat
here. Go read it before reading the rest of this e-mail.

It might seem like my suppression is overly broad, or not broad
enough, since it essentially targets LogicalTapeFreeze(). I don't
think it is, though, because this can occur in two places within
LogicalTapeFreeze() -- it can occur in the place we actually saw the
issue on lousyjack, from the ltsReadBlock() call within
LogicalTapeFreeze(), as well as a second place -- when
BufFileExportShared() is called. I found that you have to tweak code
to prevent it happening in the first place before you'll see it happen
in the second place. I see no point in actually playing whack-a-mole
for a totally benign issue like this, though, which made me finally
decide upon the suppression approach.

Bear in mind that a third way of fixing this would be to allocate
logtape.c buffers using palloc0() rather than palloc() (though I don't
like that idea at all). For serial external sorts, the logtape.c
buffers are guaranteed to have been written to/initialized at least
once as part of spilling a sort to disk. Parallel external sorts don't
quite guarantee that, which is why we run into this Valgrind issue.

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Fri, Feb 2, 2018 at 10:26 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> My proposed commit message
> has a full explanation of the Valgrind issue, which I won't repeat
> here. Go read it before reading the rest of this e-mail.

I'm going to paste the first two sentences of your proposed commit
message in here for the convenience of other readers, since I want to
reply to them.

# LogicalTapeFreeze() may write out its first block when it is dirty but
# not full, and then immediately read the first block back in from its
# BufFile as a BLCKSZ-width block.  This can only occur in rare cases
# where next to no tuples were written out, which is only possible with
# parallel external tuplesorts.

So, if I understand correctly what you're saying here, valgrind is
totally cool with us writing out an only-partially-initialized block
to a disk file, but it's got a real problem with us reading that data
back into the same memory space it already occupies.  That's a little
odd.  I presume that it's common for the tail of the final block
written to be uninitialized, but normally when we then go read block
0, that's some other, fully initialized block.

It seems like it would be pretty easy to just suppress the useless
read when we've already got the correct data, and I'd lean toward
going that direction since it's a valid optimization anyway.  But I'd
like to hear some opinions from people who use and think about
valgrind more than I do (Tom, Andres, Noah, ...?).

> It might seem like my suppression is overly broad, or not broad
> enough, since it essentially targets LogicalTapeFreeze(). I don't
> think it is, though, because this can occur in two places within
> LogicalTapeFreeze() -- it can occur in the place we actually saw the
> issue on lousyjack, from the ltsReadBlock() call within
> LogicalTapeFreeze(), as well as a second place -- when
> BufFileExportShared() is called. I found that you have to tweak code
> to prevent it happening in the first place before you'll see it happen
> in the second place.

I don't quite see how that would happen, because BufFileExportShared,
at least AFAICS, doesn't touch the buffer?

Unfortunately valgrind does not work at all on my laptop -- the server
appears to start, but as soon as you try to connect, the whole thing
dies with an error claiming that the startup process has failed.  So I
can't easily test this at the moment.  I'll try to get it working,
here or elsewhere, but thought I'd send the above reply first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Feb 5, 2018 at 9:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> # LogicalTapeFreeze() may write out its first block when it is dirty but
> # not full, and then immediately read the first block back in from its
> # BufFile as a BLCKSZ-width block.  This can only occur in rare cases
> # where next to no tuples were written out, which is only possible with
> # parallel external tuplesorts.
>
> So, if I understand correctly what you're saying here, valgrind is
> totally cool with us writing out an only-partially-initialized block
> to a disk file, but it's got a real problem with us reading that data
> back into the same memory space it already occupies.

That's not quite it. Valgrind is cool with a BufFileWrite(), which
doesn't result in an actual write() because the buffile.c stdio-style
buffer (which isn't where the uninitialized bytes originate from)
isn't yet filled. The actual write() comes later, and that's the point
that Valgrind complains. IOW, Valgrind is cool with copying around
uninitialized memory before we do anything with the underlying values
(e.g., write(), something that affects control flow).

> I presume that it's common for the tail of the final block
> written to be uninitialized, but normally when we then go read block
> 0, that's some other, fully initialized block.

It certainly is common. In the case of logtape.c, we almost always
write out some garbage bytes, even with serial sorts. The only
difference here is the *sense* in which they're garbage: they're
uninitialized bytes, which Valgrind cares about, rather than byte from
previous writes that are left behind in the buffer, which Valgrind
does not care about.

>> It might seem like my suppression is overly broad, or not broad
>> enough, since it essentially targets LogicalTapeFreeze(). I don't
>> think it is, though, because this can occur in two places within
>> LogicalTapeFreeze() -- it can occur in the place we actually saw the
>> issue on lousyjack, from the ltsReadBlock() call within
>> LogicalTapeFreeze(), as well as a second place -- when
>> BufFileExportShared() is called. I found that you have to tweak code
>> to prevent it happening in the first place before you'll see it happen
>> in the second place.
>
> I don't quite see how that would happen, because BufFileExportShared,
> at least AFAICS, doesn't touch the buffer?

It doesn't have to -- at least not directly. Valgrind remembers that
the uninitialized memory from logtape.c buffers are poisoned -- it
"spreads". The knowledge that the bytes are poisoned is tracked as
they're copied around. You get the error on the write() from the
BufFile buffer, despite the fact that you can make the error go away
by using palloc0() instead of palloc() within logtape.c, and nowhere
else.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> It certainly is common. In the case of logtape.c, we almost always
> write out some garbage bytes, even with serial sorts. The only
> difference here is the *sense* in which they're garbage: they're
> uninitialized bytes, which Valgrind cares about, rather than byte from
> previous writes that are left behind in the buffer, which Valgrind
> does not care about.

/me face-palms.

So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
on the buffer.  "We know what we're doing, trust us!"

In some ways, that seems better than inserting a suppression, because
it only affects the memory in the buffer.

Anybody else want to express an opinion here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree indexcreation)

From
"Tels"
Date:
On Mon, February 5, 2018 4:27 pm, Robert Haas wrote:
> On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> It certainly is common. In the case of logtape.c, we almost always
>> write out some garbage bytes, even with serial sorts. The only
>> difference here is the *sense* in which they're garbage: they're
>> uninitialized bytes, which Valgrind cares about, rather than byte from
>> previous writes that are left behind in the buffer, which Valgrind
>> does not care about.
>
> /me face-palms.
>
> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
> on the buffer.  "We know what we're doing, trust us!"
>
> In some ways, that seems better than inserting a suppression, because
> it only affects the memory in the buffer.
>
> Anybody else want to express an opinion here?

Are the uninitialized bytes that are written out "whatever was in the
memory previously" or just some "0x00 bytes from the allocation but not
yet overwritten from the PG code"?

Because the first sounds like it could be a security problem - if random
junk bytes go out to the disk, and stay there, information could
inadvertedly leak to permanent storage.

Best regards,

Tels


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Feb 5, 2018 at 1:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> It certainly is common. In the case of logtape.c, we almost always
>> write out some garbage bytes, even with serial sorts. The only
>> difference here is the *sense* in which they're garbage: they're
>> uninitialized bytes, which Valgrind cares about, rather than byte from
>> previous writes that are left behind in the buffer, which Valgrind
>> does not care about.

I should clarify what I meant here -- it is very common when we have
to freeze a tape, like when we do a serial external randomAccess
tuplesort, or a parallel worker's tuplesort. It shouldn't happen
otherwise. Note that there is a general pattern of dumping out the
current buffer just as the next one is needed, in order to make sure
that the linked list pointer correctly points to the
next/soon-to-be-current block. Note also that the majority of routines
declared within logtape.c can only be used on frozen tapes.

I am pretty confident that I've scoped this correctly by targeting
LogicalTapeFreeze().

> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
> on the buffer.  "We know what we're doing, trust us!"
>
> In some ways, that seems better than inserting a suppression, because
> it only affects the memory in the buffer.

I think that that would also work, and would be simpler, but also
slightly inferior to using the proposed suppression. If there is
garbage in logtape.c buffers, we still generally don't want to do
anything important on the basis of those values. We make one exception
with the suppression, which is a pretty typical kind of exception to
make -- don't worry if we write() poisoned bytes, since those are
bound to be alignment related.

OTOH, as I've said we are generally bound to write some kind of
logtape.c garbage, which will almost certainly not be of the
uninitialized memory variety. So, while I feel that the suppression is
better, the advantage is likely microscopic.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Feb 5, 2018 at 1:39 PM, Tels <nospam-abuse@bloodgate.com> wrote:
> Are the uninitialized bytes that are written out "whatever was in the
> memory previously" or just some "0x00 bytes from the allocation but not
> yet overwritten from the PG code"?
>
> Because the first sounds like it could be a security problem - if random
> junk bytes go out to the disk, and stay there, information could
> inadvertedly leak to permanent storage.

But you can say the same thing about *any* of the
write()-of-uninitialized-bytes Valgrind suppressions that already
exist. There are quite a few of those.

That just isn't part of our security model.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Mon, Feb 5, 2018 at 1:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
>> on the buffer.  "We know what we're doing, trust us!"
>>
>> In some ways, that seems better than inserting a suppression, because
>> it only affects the memory in the buffer.
>
> I think that that would also work, and would be simpler, but also
> slightly inferior to using the proposed suppression. If there is
> garbage in logtape.c buffers, we still generally don't want to do
> anything important on the basis of those values. We make one exception
> with the suppression, which is a pretty typical kind of exception to
> make -- don't worry if we write() poisoned bytes, since those are
> bound to be alignment related.
>
> OTOH, as I've said we are generally bound to write some kind of
> logtape.c garbage, which will almost certainly not be of the
> uninitialized memory variety. So, while I feel that the suppression is
> better, the advantage is likely microscopic.

Attached patch does it to the tail of the buffer, as Tom suggested on
the -committers thread.

Note that there is one other place in logtape.c that can write a
partial block like this: LogicalTapeRewindForRead(). I haven't
bothered to do anything there, since it cannot possibly be affected by
this issue for the same reason that serial sorts cannot be -- it's
code that is only used by a tuplesort that really needs to spill to
disk, and merge multiple runs (or for tapes that have already been
frozen, that are expected to never reallocate logtape.c buffers).

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Unfortunately valgrind does not work at all on my laptop -- the server
> appears to start, but as soon as you try to connect, the whole thing
> dies with an error claiming that the startup process has failed.  So I
> can't easily test this at the moment.  I'll try to get it working,
> here or elsewhere, but thought I'd send the above reply first.

Do you want somebody who does have a working valgrind installation
(ie me) to take responsibility for pushing this patch?

            regards, tom lane


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Unfortunately valgrind does not work at all on my laptop -- the server
>> appears to start, but as soon as you try to connect, the whole thing
>> dies with an error claiming that the startup process has failed.  So I
>> can't easily test this at the moment.  I'll try to get it working,
>> here or elsewhere, but thought I'd send the above reply first.
>
> Do you want somebody who does have a working valgrind installation
> (ie me) to take responsibility for pushing this patch?

I committed it before seeing this.  It probably would've been better
if you had done it, but I assume Peter tested it, so let's see what
the BF thinks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Feb 6, 2018 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this.  It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.

I did test it with a full "make installcheck" + valgrind-3.11.0. I'd
be very surprised if this doesn't make the buildfarm go green.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Tomas Vondra
Date:
On 02/06/2018 09:56 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Do you want somebody who does have a working valgrind installation
>>> (ie me) to take responsibility for pushing this patch?
>>
>> I committed it before seeing this.  It probably would've been better
>> if you had done it, but I assume Peter tested it, so let's see what
>> the BF thinks.
> 
> I did test it with a full "make installcheck" + valgrind-3.11.0. I'd
> be very surprised if this doesn't make the buildfarm go green.
> 

Did you do a test with "-O0"? In my experience that makes valgrind tests
much more reliable and repeatable. Some time ago we've seen cases that
were failing for me but not for others, and I suspect it was due to me
using "-O0".

(This is more a random comment than a suggestion that you patch won't
make the buildfarm green.)

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Feb 6, 2018 at 1:04 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Did you do a test with "-O0"? In my experience that makes valgrind tests
> much more reliable and repeatable. Some time ago we've seen cases that
> were failing for me but not for others, and I suspect it was due to me
> using "-O0".

FWIW, I use -O1 when configure is run for Valgrind. I also turn off
assertions (this is all scripted). According to the Valgrind manual:

"With -O1 line numbers in error messages can be inaccurate, although
generally speaking running Memcheck on code compiled at -O1 works
fairly well, and the speed improvement compared to running -O0 is
quite significant. Use of -O2 and above is not recommended as Memcheck
occasionally reports uninitialised-value errors which don’t really
exist."

The manual does also say that there might even be some problems with
-O1 at a later point, but it sounds like it's probably worth it to me.
Skink uses -Og, FWIW.

--
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Tomas Vondra
Date:

On 02/06/2018 10:14 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 1:04 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Did you do a test with "-O0"? In my experience that makes valgrind tests
>> much more reliable and repeatable. Some time ago we've seen cases that
>> were failing for me but not for others, and I suspect it was due to me
>> using "-O0".
> 
> FWIW, I use -O1 when configure is run for Valgrind. I also turn off
> assertions (this is all scripted). According to the Valgrind manual:
> 
> "With -O1 line numbers in error messages can be inaccurate, although
> generally speaking running Memcheck on code compiled at -O1 works
> fairly well, and the speed improvement compared to running -O0 is
> quite significant. Use of -O2 and above is not recommended as Memcheck
> occasionally reports uninitialised-value errors which don’t really
> exist."
> 

OK, although I was suggesting the optimizations may actually have the
opposite effect - valgrind missing some of the invalid memory accesses
(until the compiler decides not use them for some reason, causing sudden
valgrind failures).

> The manual does also say that there might even be some problems with
> -O1 at a later point, but it sounds like it's probably worth it to me.
> Skink uses -Og, FWIW.
> 

I have little idea what -Og exactly means. It seems to be focused on
debugging experience, and so still does some of the optimizations. Which
I think would explain why skink was not detecting some of the failures
for a long time.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Tue, Feb 6, 2018 at 1:30 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I have little idea what -Og exactly means. It seems to be focused on
> debugging experience, and so still does some of the optimizations.

As I understand it, -Og allows any optimization that does not hamper
walking through code with a debugger.

> Which
> I think would explain why skink was not detecting some of the failures
> for a long time.

I think that skink didn't detect failures until now because the code
wasn't exercised until parallel CREATE INDEX was added, simply because
the function LogicalTapeFreeze() was never reached (though that's not
the only reason, it is the most obvious one).

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Tomas Vondra
Date:

On 02/06/2018 10:39 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 1:30 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I have little idea what -Og exactly means. It seems to be focused on
>> debugging experience, and so still does some of the optimizations.
> 
> As I understand it, -Og allows any optimization that does not hamper
> walking through code with a debugger.
> 
>> Which
>> I think would explain why skink was not detecting some of the failures
>> for a long time.
> 
> I think that skink didn't detect failures until now because the code
> wasn't exercised until parallel CREATE INDEX was added, simply because
> the function LogicalTapeFreeze() was never reached (though that's not
> the only reason, it is the most obvious one).
> 

Maybe. What I had in mind was a different thread from November,
discussing some non-deterministic valgrind failures:


https://www.postgresql.org/message-id/flat/20171125200014.qbewtip5oydqsklt%40alap3.anarazel.de#20171125200014.qbewtip5oydqsklt@alap3.anarazel.de

But you're right that may be irrelevant here. As I said, it was mostly
just a random comment about valgrind.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Tue, Feb 6, 2018 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Unfortunately valgrind does not work at all on my laptop -- the server
>>> appears to start, but as soon as you try to connect, the whole thing
>>> dies with an error claiming that the startup process has failed.  So I
>>> can't easily test this at the moment.  I'll try to get it working,
>>> here or elsewhere, but thought I'd send the above reply first.
>>
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this.  It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.

skink and lousyjack seem happy now, so I think it worked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Prabhat Sahu
Date:
Hi all,

While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.

-- GUCs under postgres.conf
max_parallel_maintenance_workers = 16
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 8GB
max_wal_size = 4GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 


--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company


On Thu, Feb 8, 2018 at 6:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 6, 2018 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Unfortunately valgrind does not work at all on my laptop -- the server
>>> appears to start, but as soon as you try to connect, the whole thing
>>> dies with an error claiming that the startup process has failed.  So I
>>> can't easily test this at the moment.  I'll try to get it working,
>>> here or elsewhere, but thought I'd send the above reply first.
>>
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this.  It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.

skink and lousyjack seem happy now, so I think it worked.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Mar 7, 2018 at 8:13 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
Hi all,

While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.

-- GUCs under postgres.conf
max_parallel_maintenance_workers = 16
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 8GB
max_wal_size = 4GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 

That makes it look like perhaps one of the worker backends crashed.  Did you get a message in the logfile that might indicate the nature of the crash?  Something with PANIC or TRAP, perhaps?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Prabhat Sahu
Date:

On Wed, Mar 7, 2018 at 7:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 7, 2018 at 8:13 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
Hi all,

While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.

-- GUCs under postgres.conf
max_parallel_maintenance_workers = 16
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 8GB
max_wal_size = 4GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 

That makes it look like perhaps one of the worker backends crashed.  Did you get a message in the logfile that might indicate the nature of the crash?  Something with PANIC or TRAP, perhaps?


I am not able to see any PANIC/TRAP in log file,
Here are the contents.

[edb@localhost bin]$ cat logsnew 
2018-03-07 19:21:20.922 IST [54400] LOG:  listening on IPv6 address "::1", port 5432
2018-03-07 19:21:20.922 IST [54400] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2018-03-07 19:21:20.925 IST [54400] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2018-03-07 19:21:20.936 IST [54401] LOG:  database system was shut down at 2018-03-07 19:21:20 IST
2018-03-07 19:21:20.939 IST [54400] LOG:  database system is ready to accept connections
2018-03-07 19:24:44.263 IST [54400] LOG:  background worker "parallel worker" (PID 54482) was terminated by signal 9: Killed
2018-03-07 19:24:44.286 IST [54400] LOG:  terminating any other active server processes
2018-03-07 19:24:44.297 IST [54405] WARNING:  terminating connection because of crash of another server process
2018-03-07 19:24:44.297 IST [54405] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-03-07 19:24:44.297 IST [54405] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-03-07 19:24:44.301 IST [54478] WARNING:  terminating connection because of crash of another server process
2018-03-07 19:24:44.301 IST [54478] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-03-07 19:24:44.301 IST [54478] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-03-07 19:24:44.494 IST [54504] FATAL:  the database system is in recovery mode
2018-03-07 19:24:44.496 IST [54400] LOG:  all server processes terminated; reinitializing
2018-03-07 19:24:44.513 IST [54505] LOG:  database system was interrupted; last known up at 2018-03-07 19:22:54 IST
2018-03-07 19:24:44.552 IST [54505] LOG:  database system was not properly shut down; automatic recovery in progress
2018-03-07 19:24:44.554 IST [54505] LOG:  redo starts at 0/AB401A38
2018-03-07 19:25:14.712 IST [54505] LOG:  invalid record length at 1/818B8D80: wanted 24, got 0
2018-03-07 19:25:14.714 IST [54505] LOG:  redo done at 1/818B8D48
2018-03-07 19:25:14.714 IST [54505] LOG:  last completed transaction was at log time 2018-03-07 19:24:05.322402+05:30
2018-03-07 19:25:16.887 IST [54400] LOG:  database system is ready to accept connections

 

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
2018-03-07 19:24:44.263 IST [54400] LOG:  background worker "parallel worker" (PID 54482) was terminated by signal 9: Killed

That looks like the background worker got killed by the OOM killer.  How much memory do you have in the machine where this occurred?
 
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Tomas Vondra
Date:
On 03/07/2018 03:21 PM, Robert Haas wrote:
> On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu
> <prabhat.sahu@enterprisedb.com <mailto:prabhat.sahu@enterprisedb.com>>
> wrote:
> 
>     2018-03-07 19:24:44.263 IST [54400] LOG:  background worker
>     "parallel worker" (PID 54482) was terminated by signal 9: Killed
> 
> 
> That looks like the background worker got killed by the OOM killer.  How
> much memory do you have in the machine where this occurred?
>  

FWIW that's usually written to the system log. Does dmesg say something
about the kill?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Peter Geoghegan
Date:
On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> FWIW that's usually written to the system log. Does dmesg say something
> about the kill?

While it would be nice to confirm that it was indeed the OOM killer,
either way the crash happened because SIGKILL was sent to a parallel
worker. There is no reason to suspect a bug.

-- 
Peter Geoghegan


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Andres Freund
Date:

On March 7, 2018 5:40:18 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> FWIW that's usually written to the system log. Does dmesg say
>something
>> about the kill?
>
>While it would be nice to confirm that it was indeed the OOM killer,
>either way the crash happened because SIGKILL was sent to a parallel
>worker. There is no reason to suspect a bug.

Not impossible there's a leak somewhere though.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Prabhat Sahu
Date:

On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
2018-03-07 19:24:44.263 IST [54400] LOG:  background worker "parallel worker" (PID 54482) was terminated by signal 9: Killed

That looks like the background worker got killed by the OOM killer.  How much memory do you have in the machine where this occurred?
I have ran the testcase in my local machine with below configurations:

Environment: CentOS 7(64bit)
HD : 100GB
RAM: 4GB
Processor: 4

I have nerrowdown the testcase as below, which also reproduce the same crash.

-- GUCs under postgres.conf
maintenance_work_mem = 8GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company



--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company


On Thu, Mar 8, 2018 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:


On March 7, 2018 5:40:18 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> FWIW that's usually written to the system log. Does dmesg say
>something
>> about the kill?
>
>While it would be nice to confirm that it was indeed the OOM killer,
>either way the crash happened because SIGKILL was sent to a parallel
>worker. There is no reason to suspect a bug.

Not impossible there's a leak somewhere though.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Tom Lane
Date:
Prabhat Sahu <prabhat.sahu@enterprisedb.com> writes:
> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That looks like the background worker got killed by the OOM killer.  How
>> much memory do you have in the machine where this occurred?

> I have ran the testcase in my local machine with below configurations:
> Environment: CentOS 7(64bit)
> HD : 100GB
> RAM: 4GB
> Processor: 4

If you only have 4GB of physical RAM, it hardly seems surprising that
trying to use 8GB of maintenance_work_mem would draw the wrath of the
OOM killer.

            regards, tom lane


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From
Robert Haas
Date:
On Thu, Mar 8, 2018 at 11:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Prabhat Sahu <prabhat.sahu@enterprisedb.com> writes:
>> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> That looks like the background worker got killed by the OOM killer.  How
>>> much memory do you have in the machine where this occurred?
>
>> I have ran the testcase in my local machine with below configurations:
>> Environment: CentOS 7(64bit)
>> HD : 100GB
>> RAM: 4GB
>> Processor: 4
>
> If you only have 4GB of physical RAM, it hardly seems surprising that
> trying to use 8GB of maintenance_work_mem would draw the wrath of the
> OOM killer.

Yup.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company