Thread: PARALLEL CTAS

PARALLEL CTAS

From

Shane Borden

Date:

12 December 2022, 16:16:49

Based on testing thus far, it appears that the only way to get a bulk load operation (similar to oracle) would be to use the COPY command or to use multiple connections and break up the inserts across connections. I did see a possibility of using the postgres_fdw extension and use the batch functionality, but I haven’t tested that yet.

Does anyone know the status of this feature / patch that was discussed in this link?

https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com

Is there a way in PostgreSQL to parallelize a CTAS statement similar to how you can do it in Oracle?

---

Thanks,

Shane Borden
sborden76@gmail.com

Re: PARALLEL CTAS

From

Erik Brandsberg

Date:

12 December 2022, 16:55:39

Copy operations are by far the most efficient way to do bulk loads. The fdw extension isn't very efficient based on prior experience.

On Mon, Dec 12, 2022 at 11:17 AM Shane Borden <sborden76@gmail.com> wrote:

Based on testing thus far, it appears that the only way to get a bulk load operation (similar to oracle) would be to use the COPY command or to use multiple connections and break up the inserts across connections. I did see a possibility of using the postgres_fdw extension and use the batch functionality, but I haven’t tested that yet.

Does anyone know the status of this feature / patch that was discussed in this link?

https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com

Is there a way in PostgreSQL to parallelize a CTAS statement similar to how you can do it in Oracle?

---

Thanks,

Shane Borden
sborden76@gmail.com

Re: PARALLEL CTAS

From

Shane Borden

Date:

12 December 2022, 17:13:36

The issue is there are certain performance benefits to be had by doing parallel CTAS operations and when converting from Oracle to PostgreSQL switching to a “COPY” operation isn’t feasible.

---

Thanks,

Shane Borden
sborden76@gmail.com

On Dec 12, 2022, at 11:55 AM, Erik Brandsberg <erik@heimdalldata.com> wrote:

Copy operations are by far the most efficient way to do bulk loads. The fdw extension isn't very efficient based on prior experience.

On Mon, Dec 12, 2022 at 11:17 AM Shane Borden <sborden76@gmail.com> wrote:
Based on testing thus far, it appears that the only way to get a bulk load operation (similar to oracle) would be to use the COPY command or to use multiple connections and break up the inserts across connections. I did see a possibility of using the postgres_fdw extension and use the batch functionality, but I haven’t tested that yet.

Does anyone know the status of this feature / patch that was discussed in this link?

https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com

Is there a way in PostgreSQL to parallelize a CTAS statement similar to how you can do it in Oracle?

---

Thanks,

Shane Borden
sborden76@gmail.com

Re: PARALLEL CTAS

From

Rob Sargent

Date:

12 December 2022, 17:25:08

On 12/12/22 10:13, Shane Borden wrote:
> The issue is there are certain performance benefits to be had by doing 
> parallel CTAS operations and when converting from Oracle to PostgreSQL 
> switching to a “COPY” operation isn’t feasible.
> ---
>
> Thanks,
>
Today I suspect you're left with something like the following:
- CTAS from source where 1=2 (i.e. table definition via select semantics)
- copy from stdin (filled with intended CTAS select)

Re: PARALLEL CTAS

From

Shane Borden

Date:

12 December 2022, 17:36:11

I saw this as an option. Let’s say you have a million row table. If you copy from STDIN, do you have to spool the source table to a local file first? Looking to replace a CTAS thats doing this all within SQL statements now.

---

Thanks,

Shane Borden
sborden76@gmail.com

On Dec 12, 2022, at 12:25 PM, Rob Sargent <robjsargent@gmail.com> wrote:

On 12/12/22 10:13, Shane Borden wrote:
The issue is there are certain performance benefits to be had by doing parallel CTAS operations and when converting from Oracle to PostgreSQL switching to a “COPY” operation isn’t feasible.
---

Thanks,

Today I suspect you're left with something like the following:
- CTAS from source where 1=2 (i.e. table definition via select semantics)
- copy from stdin (filled with intended CTAS select)

Re: PARALLEL CTAS

From

Steve Midgley

Date:

12 December 2022, 18:19:00

On Mon, Dec 12, 2022 at 9:36 AM Shane Borden <sborden76@gmail.com> wrote:

I saw this as an option. Let’s say you have a million row table. If you copy from STDIN, do you have to spool the source table to a local file first? Looking to replace a CTAS thats doing this all within SQL statements now.
---

Thanks,

Shane Borden
sborden76@gmail.com

On Dec 12, 2022, at 12:25 PM, Rob Sargent <robjsargent@gmail.com> wrote:

On 12/12/22 10:13, Shane Borden wrote:
The issue is there are certain performance benefits to be had by doing parallel CTAS operations and when converting from Oracle to PostgreSQL switching to a “COPY” operation isn’t feasible.
---

Thanks,

Today I suspect you're left with something like the following:
- CTAS from source where 1=2 (i.e. table definition via select semantics)
- copy from stdin (filled with intended CTAS select)

I'm not at all familiar with Oracle / CTAS (fair warning). But I did spend some time, years ago, building a fast import using COPY/STDIN in Postgres awhile back. I copied that code into a Ruby language gist demonstrating the core concept recently.

The key idea, IMO, is to use an intermediate language like Ruby to manage the "spooling" from the source file to STDIN. That system could presumably fork processes to have more than one COPY from the file going at a time (and coordinate each thread to access different rows of the source file).

However, my experience was that if you want the absolutely fastest import, you should copy the source file to a fast disk/array on the Postgres server itself, and then run the COPY command pointing to that file locally, not piping from STDIN over a network connection. Postgres, at least in my testing, maxes I/O on a single COPY command pretty heartily - so much so that you might find performance otherwise becomes a problem, esp if your DBs are on that disk/array. IIRC, I was seeing millions of rows loaded per second on a high performing Postgres server, using this method. Also, IIRC, I removed the indexes from the table, again to max throughput, and then added them back after the COPY completed (not sure if modern Postgres COPY needs that - this was years ago).

I think total time to completion was also faster, as the index rebuilds were faster after the fact, than done during the loading (again I'm foggy so maybe inaccurate - this was years ago), but I was loading into a blank table -- rebuilding already constructed indexes might have an overwhelming cost.

If you have to do a network based COPY from STDIN, then my gist above will hopefully show the way - and I wouldn't be surprised if running multiple of these simultaneously gets better performance than a single COPY. At that point, you'll just be fine tuning to max saturation on the network bandwidth - not the max capability of the postgres server, which the local disk COPY does.

Hopefully this input is relevant - please ignore if I'm missing your goal entirely!

Best,

Steve

Re: PARALLEL CTAS

From

Tom Lane

Date:

12 December 2022, 18:52:13

>> Today I suspect you're left with something like the following:
>> - CTAS from source where 1=2 (i.e. table definition via select semantics)
>> - copy from stdin (filled with intended CTAS select)

As far as I can tell, all supported versions of Postgres are perfectly
content to parallelize the source-row computation in a CREATE TABLE AS
SELECT, if they would parallelize the underlying SELECT.  Note that
this is not the same as INSERT INTO ... SELECT, which is a harder
problem because the target table might already have indexes, constraints,
etc.

If what you are concerned about is parallelization of the physical
insertions of the tuples, I'm pretty sure we don't have anything that
can do that today, including COPY.  COPY does have some batch-insertion
optimizations, but that's not parallelism.

Can you parallelize your problem at a higher level, ie do several
table loads at once?

            regards, tom lane