Re: a funnel by any other name - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: a funnel by any other name
Date
Msg-id CANP8+jK6SLnND6tGwNdpkw=h_SyoCt8Nd5521AOyA50M9NrNsg@mail.gmail.com
Whole thread Raw
In response to Re: a funnel by any other name  (Nicolas Barbier <nicolas.barbier@gmail.com>)
Responses Re: a funnel by any other name
List pgsql-hackers
On 17 September 2015 at 05:07, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:
2015-09-17 Robert Haas <robertmhaas@gmail.com>:

> 1. Exchange Bushy
> 2. Exchange Inter-Operator (this is what's currently implemented)
> 3. Exchange Replicate
> 4. Exchange Merge
> 5. Interchange

> 1. ?
> 2. Gather
> 3. Broadcast (sorta)
> 4. Gather Merge
> 5. Redistribute

> 1. Parallel Child
> 2. Parallel Gather
> 3. Parallel Replicate
> 4. Parallel Merge
> 5. Parallel Redistribute

FYI, SQL Server has these in its execution plans:

* Distribute Streams: read from one thread, write to multiple threads
* Repartition Streams: both read and write from/to multiple threads
* Gather Streams: read from multiple threads, write to one thread

Robert, thanks for asking. We'll be stuck with these words for some time, user visible via EXPLAIN so this is important.

In general we should stick to words already used in other similar situations, which could include DBMS and parallel ETL tools, of which there are many more than mentioned here.

I would be against using any of these words: Funnel, Motion, Bushy because I don't find them very descriptive (I think of spiders, bowels and shrubs respectively, sorry).

These words are liable to confusion with other concepts: Replicate, Duplicate, Distribute, Partition, Repartition, MERGE.

I've seen this concept called Fan-In/Fan-Out and Scatter/Gather

The main operations are the 3 mentioned by Nicolas:
1. Send data from many to one - which has subtypes for Unsorted, Sorted and Evenly balanced (but unsorted)
2. Send data from one process to many
3. Send data from many to many

My preferences for this would be 
1. Gather (but not Gather Motion) e.g. Gather, Gather Sorted
2. Scatter (since Broadcast only makes sense in the context of a distributed query, it sounds weird for intra-node query)
3. Redistribution - which implies the description of how we spread data across nodes is "Distribution" (or DISTRIBUTED BY)

For 3 we should definitely use Redistribute, since this is what Teradata has been calling it for 30 years, which is where Greenplum got it from.
For 1, Gather makes most sense.

For 2, it could be either Scatter or Distribute. The former works well with Gather, the latter works well with Redistribute.

Sorry for my absence for further review on parallel ops.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

Previous
From: Geoff Winkless
Date:
Subject: Re: [COMMITTERS] pgsql: Use gender-neutral language in documentation
Next
From: Stephen Frost
Date:
Subject: Re: row_security GUC, BYPASSRLS