Thread: COPY threads

COPY threads

From
Rob Sargent
Date:
Can anyone here tell me whether or not the CopyManager facility in JDBC 
via org.postgresql:postgresql:42.1.4 is internally multithreaded? 
Running on CentOS 7 (all participants), java8, postgres 10.5




Re: COPY threads

From
Laurenz Albe
Date:
Rob Sargent wrote:
> Can anyone here tell me whether or not the CopyManager facility in JDBC 
> via org.postgresql:postgresql:42.1.4 is internally multithreaded? 
> Running on CentOS 7 (all participants), java8, postgres 10.5

It isn't, and there would not be much reason for it to be, as COPY
in PostgreSQL cannot be parallelized.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Re: COPY threads

From
Rob Sargent
Date:


On Oct 10, 2018, at 10:50 AM, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Rob Sargent wrote:
Can anyone here tell me whether or not the CopyManager facility in JDBC
via org.postgresql:postgresql:42.1.4 is internally multithreaded?
Running on CentOS 7 (all participants), java8, postgres 10.5

It isn't, and there would not be much reason for it to be, as COPY
in PostgreSQL cannot be parallelized.
Thank you for that confirmation.  I had all cores firing on my copyManager box and it turned out it was really just looking really hard for more byte to free up.  I was getting a lot more out of the database than I could chew.  OOM ensued.

Cheers,
rjs

Re: COPY threads

From
Andres Freund
Date:
On 2018-10-10 18:50:02 +0200, Laurenz Albe wrote:
> Rob Sargent wrote:
> > Can anyone here tell me whether or not the CopyManager facility in JDBC 
> > via org.postgresql:postgresql:42.1.4 is internally multithreaded? 
> > Running on CentOS 7 (all participants), java8, postgres 10.5
> 
> It isn't, and there would not be much reason for it to be, as COPY
> in PostgreSQL cannot be parallelized.

s/cannot/is not/.

You obviously can just copy the data into postgres over multiple
connections if you need to speed COPY up. But that requires splitting up
the data on the clientside.

Greetings,

Andres Freund


Re: COPY threads

From
Rob Sargent
Date:

> On Oct 10, 2018, at 1:24 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2018-10-10 18:50:02 +0200, Laurenz Albe wrote:
>> Rob Sargent wrote:
>>> Can anyone here tell me whether or not the CopyManager facility in JDBC
>>> via org.postgresql:postgresql:42.1.4 is internally multithreaded?
>>> Running on CentOS 7 (all participants), java8, postgres 10.5
>>
>> It isn't, and there would not be much reason for it to be, as COPY
>> in PostgreSQL cannot be parallelized.
>
> s/cannot/is not/.
>
> You obviously can just copy the data into postgres over multiple
> connections if you need to speed COPY up. But that requires splitting up
> the data on the clientside.
>
> Greetings,
>
> Andres Freund
Interesting note, deeply pocketed.  The COPY portion is /not/ my problem - that’s plenty fast enough and memory
efficientenough straight from the box.  I had over looked that I was requesting 33M records, in a map with a longish
CSVstring as key...then playing sort games with keys and ... well that’s as stupid as I got. 



Re: COPY threads

From
Ravi Krishna
Date:
>
> You obviously can just copy the data into postgres over multiple
> connections if you need to speed COPY up. But that requires splitting up
> the data on the clientside.
>

You obviously are referring to multiple connections running COPY on different tables, right?  Like what pg_restore does
with-j option. 
Doesn't copy take an exclusive lock on the table which makes it incompatible with parallelization.



Re: COPY threads

From
Christopher Browne
Date:
On Wed, 10 Oct 2018 at 16:22, Ravi Krishna <srkrishna1@aol.com> wrote:
> You obviously are referring to multiple connections running COPY on different tables, right?  Like what pg_restore
doeswith -j option.
 
> Doesn't copy take an exclusive lock on the table which makes it incompatible with parallelization.

No, why would that seem to be the case?  If it did so, then you could
not run pg_dump to dump data while regular activity was going on.

That's decidedly not the case.

The challenge in parallelizing a dump via COPY TO is in ensuring that
the multiple requests are attached to the same serializable
transaction.  There's a function now that allows multiple connections
to attach to the same transaction context, I believe...  Also, there's
the challenge in actually splitting the data, so that both requests
are dumping different data; that might be fairly expensive whether
with or without indices.

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: COPY threads

From
Ravi Krishna
Date:
>
> No, why would that seem to be the case?  If it did so, then you could
> not run pg_dump to dump data while regular activity was going on.


Not sure. In fact I am now confused.
I am talking about pg_restore which is COPY FROM, which takes exclusive lock on the table
while the data is loaded into the table.
pg_dump does COPY TO which reads data out of the table to a file or something else.

Re: COPY threads

From
Andres Freund
Date:
Hi,

On 2018-10-10 17:02:59 -0400, Ravi Krishna wrote:
> I am talking about pg_restore which is COPY FROM, which takes exclusive lock on the table
> while the data is loaded into the table.

pg_restore doesn't take locks on the table for the COPY, it does so
because creating the table takes an exclusive lock.

Greetings,

Andres Freund


Re: COPY threads

From
Ravi Krishna
Date:
>
> pg_restore doesn't take locks on the table for the COPY, it does so
> because creating the table takes an exclusive lock.


Interesting.  I seem to recollect reading here that I can't have concurrent COPY on the same table because of the lock.
To give an example:

If I have a large file with say 400 million rows, can I first split it into 10 files of 40 million rows each and then
fireup 10 different 
COPY sessions , each reading from a split file, but copying into the same table.  I thought not.  It will be great if
wecan do this. 

Re: COPY threads

From
Andres Freund
Date:

On October 10, 2018 2:15:19 PM PDT, Ravi Krishna <srkrishna1@aol.com> wrote:
>>
>> pg_restore doesn't take locks on the table for the COPY, it does so
>> because creating the table takes an exclusive lock.
>
>
>Interesting.  I seem to recollect reading here that I can't have
>concurrent COPY on the same table because of the lock.
>To give an example:
>
>If I have a large file with say 400 million rows, can I first split it
>into 10 files of 40 million rows each and then fire up 10 different
>COPY sessions , each reading from a split file, but copying into the
>same table.  I thought not.  It will be great if we can do this.

Yes, you can.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: COPY threads

From
Ravi Krishna
Date:
Thank you.  Let me test it and see the benefit. We have a use case for this.


> On Oct 10, 2018, at 17:18 , Andres Freund <andres@anarazel.de> wrote:
>
>
>
> On October 10, 2018 2:15:19 PM PDT, Ravi Krishna <srkrishna1@aol.com> wrote:
>>>
>>> pg_restore doesn't take locks on the table for the COPY, it does so
>>> because creating the table takes an exclusive lock.
>>
>>
>> Interesting.  I seem to recollect reading here that I can't have
>> concurrent COPY on the same table because of the lock.
>> To give an example:
>>
>> If I have a large file with say 400 million rows, can I first split it
>> into 10 files of 40 million rows each and then fire up 10 different
>> COPY sessions , each reading from a split file, but copying into the
>> same table.  I thought not.  It will be great if we can do this.
>
> Yes, you can.
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: COPY threads

From
"Peter J. Holzer"
Date:
On 2018-10-10 17:19:50 -0400, Ravi Krishna wrote:
> > On Oct 10, 2018, at 17:18 , Andres Freund <andres@anarazel.de> wrote:
> > On October 10, 2018 2:15:19 PM PDT, Ravi Krishna <srkrishna1@aol.com> wrote:
> >> If I have a large file with say 400 million rows, can I first split it
> >> into 10 files of 40 million rows each and then fire up 10 different
> >> COPY sessions , each reading from a split file, but copying into the
> >> same table.  I thought not.  It will be great if we can do this.
> >
> > Yes, you can.
> >
> Thank you.  Let me test it and see the benefit. We have a use case for this.

You should of course test this on your own hardware with your own data,
but here are the results of a simple benchmark (import 1 million rows
into a table without indexes via different methods) I ran a few weeks
ago on one of our servers:

https://github.com/hjp/dbbench/blob/master/import_pg_comparison/results/claudrin.2018-09-22/results.png

y axis is rows per second. x axis are different runs, sorted from
slowest to fastest (so 2 is the median).

As you can see it doesn't parallelize perfectly: 2 copy processes are
only about 50 % faster than 1, and 4 are about 33 % faster than 2. But
there is a still quite a respectable performance boost.

        hp

PS: The script is of course in the same repo, but I didn't include the
test data because I don't think I'm allowed to include that.

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: COPY threads

From
Thomas Kellerer
Date:
Rob Sargent schrieb am 10.10.2018 um 00:45:> Can anyone here tell me whether or not the CopyManager facility in
> JDBC via org.postgresql:postgresql:42.1.4 is internally
> multithreaded? Running on CentOS 7 (all participants), java8,
> postgres 10.5

An alternative to creating your own multi-threaded importer, might be to use pgLoader which supports that out of the
box:

https://pgloader.readthedocs.io/en/latest/pgloader.html#a-note-about-parallelism

I have never used it though, but heard good things about it.


Re: COPY threads

From
Ravi Krishna
Date:
>>>
>> Thank you.  Let me test it and see the benefit. We have a use case for this.
>

Well the result is not what I expected.

this is the sql I used

copy TEST.TABLE1 from '/var/lib/postgresql/data/rkrishna/copytesting/xaa' with delimiter '|' NULL as '' CSV HEADER;

From another session

copy TEST.TABLE1 from '/var/lib/postgresql/data/rkrishna/copytesting/xab' with delimiter '|' NULL as '' CSV HEADER;

Each had 16M rows.

I see that one copy is blocking other.

The table has no indexes while loading.

Aren't they suppose to run  concurrently without locking ?



Re: COPY threads

From
Ravi Krishna
Date:
>
> Well the result is not what I expected.
>
> this is the sql I used
>
> copy TEST.TABLE1 from '/var/lib/postgresql/data/rkrishna/copytesting/xaa' with delimiter '|' NULL as '' CSV HEADER;
>
> From another session
>
> copy TEST.TABLE1 from '/var/lib/postgresql/data/rkrishna/copytesting/xab' with delimiter '|' NULL as '' CSV HEADER;
>
> Each had 16M rows.
>
> I see that one copy is blocking other.
>
> The table has no indexes while loading.
>
> Aren't they suppose to run  concurrently without locking ?
>
>

Strike that.  The lock wait is bogus.  The data did get loaded concurrently.