Thread: COPY threads
Can anyone here tell me whether or not the CopyManager facility in JDBC via org.postgresql:postgresql:42.1.4 is internally multithreaded? Running on CentOS 7 (all participants), java8, postgres 10.5
Rob Sargent wrote: > Can anyone here tell me whether or not the CopyManager facility in JDBC > via org.postgresql:postgresql:42.1.4 is internally multithreaded? > Running on CentOS 7 (all participants), java8, postgres 10.5 It isn't, and there would not be much reason for it to be, as COPY in PostgreSQL cannot be parallelized. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
On Oct 10, 2018, at 10:50 AM, Laurenz Albe <laurenz.albe@cybertec.at> wrote:Rob Sargent wrote:Can anyone here tell me whether or not the CopyManager facility in JDBC
via org.postgresql:postgresql:42.1.4 is internally multithreaded?
Running on CentOS 7 (all participants), java8, postgres 10.5
It isn't, and there would not be much reason for it to be, as COPY
in PostgreSQL cannot be parallelized.
Cheers,
rjs
On 2018-10-10 18:50:02 +0200, Laurenz Albe wrote: > Rob Sargent wrote: > > Can anyone here tell me whether or not the CopyManager facility in JDBC > > via org.postgresql:postgresql:42.1.4 is internally multithreaded? > > Running on CentOS 7 (all participants), java8, postgres 10.5 > > It isn't, and there would not be much reason for it to be, as COPY > in PostgreSQL cannot be parallelized. s/cannot/is not/. You obviously can just copy the data into postgres over multiple connections if you need to speed COPY up. But that requires splitting up the data on the clientside. Greetings, Andres Freund
> On Oct 10, 2018, at 1:24 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2018-10-10 18:50:02 +0200, Laurenz Albe wrote: >> Rob Sargent wrote: >>> Can anyone here tell me whether or not the CopyManager facility in JDBC >>> via org.postgresql:postgresql:42.1.4 is internally multithreaded? >>> Running on CentOS 7 (all participants), java8, postgres 10.5 >> >> It isn't, and there would not be much reason for it to be, as COPY >> in PostgreSQL cannot be parallelized. > > s/cannot/is not/. > > You obviously can just copy the data into postgres over multiple > connections if you need to speed COPY up. But that requires splitting up > the data on the clientside. > > Greetings, > > Andres Freund Interesting note, deeply pocketed. The COPY portion is /not/ my problem - that’s plenty fast enough and memory efficientenough straight from the box. I had over looked that I was requesting 33M records, in a map with a longish CSVstring as key...then playing sort games with keys and ... well that’s as stupid as I got.
> > You obviously can just copy the data into postgres over multiple > connections if you need to speed COPY up. But that requires splitting up > the data on the clientside. > You obviously are referring to multiple connections running COPY on different tables, right? Like what pg_restore does with-j option. Doesn't copy take an exclusive lock on the table which makes it incompatible with parallelization.
On Wed, 10 Oct 2018 at 16:22, Ravi Krishna <srkrishna1@aol.com> wrote: > You obviously are referring to multiple connections running COPY on different tables, right? Like what pg_restore doeswith -j option. > Doesn't copy take an exclusive lock on the table which makes it incompatible with parallelization. No, why would that seem to be the case? If it did so, then you could not run pg_dump to dump data while regular activity was going on. That's decidedly not the case. The challenge in parallelizing a dump via COPY TO is in ensuring that the multiple requests are attached to the same serializable transaction. There's a function now that allows multiple connections to attach to the same transaction context, I believe... Also, there's the challenge in actually splitting the data, so that both requests are dumping different data; that might be fairly expensive whether with or without indices. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
> > No, why would that seem to be the case? If it did so, then you could > not run pg_dump to dump data while regular activity was going on. Not sure. In fact I am now confused. I am talking about pg_restore which is COPY FROM, which takes exclusive lock on the table while the data is loaded into the table. pg_dump does COPY TO which reads data out of the table to a file or something else.
Hi, On 2018-10-10 17:02:59 -0400, Ravi Krishna wrote: > I am talking about pg_restore which is COPY FROM, which takes exclusive lock on the table > while the data is loaded into the table. pg_restore doesn't take locks on the table for the COPY, it does so because creating the table takes an exclusive lock. Greetings, Andres Freund
> > pg_restore doesn't take locks on the table for the COPY, it does so > because creating the table takes an exclusive lock. Interesting. I seem to recollect reading here that I can't have concurrent COPY on the same table because of the lock. To give an example: If I have a large file with say 400 million rows, can I first split it into 10 files of 40 million rows each and then fireup 10 different COPY sessions , each reading from a split file, but copying into the same table. I thought not. It will be great if wecan do this.
On October 10, 2018 2:15:19 PM PDT, Ravi Krishna <srkrishna1@aol.com> wrote: >> >> pg_restore doesn't take locks on the table for the COPY, it does so >> because creating the table takes an exclusive lock. > > >Interesting. I seem to recollect reading here that I can't have >concurrent COPY on the same table because of the lock. >To give an example: > >If I have a large file with say 400 million rows, can I first split it >into 10 files of 40 million rows each and then fire up 10 different >COPY sessions , each reading from a split file, but copying into the >same table. I thought not. It will be great if we can do this. Yes, you can. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Thank you. Let me test it and see the benefit. We have a use case for this. > On Oct 10, 2018, at 17:18 , Andres Freund <andres@anarazel.de> wrote: > > > > On October 10, 2018 2:15:19 PM PDT, Ravi Krishna <srkrishna1@aol.com> wrote: >>> >>> pg_restore doesn't take locks on the table for the COPY, it does so >>> because creating the table takes an exclusive lock. >> >> >> Interesting. I seem to recollect reading here that I can't have >> concurrent COPY on the same table because of the lock. >> To give an example: >> >> If I have a large file with say 400 million rows, can I first split it >> into 10 files of 40 million rows each and then fire up 10 different >> COPY sessions , each reading from a split file, but copying into the >> same table. I thought not. It will be great if we can do this. > > Yes, you can. > > Andres > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 2018-10-10 17:19:50 -0400, Ravi Krishna wrote: > > On Oct 10, 2018, at 17:18 , Andres Freund <andres@anarazel.de> wrote: > > On October 10, 2018 2:15:19 PM PDT, Ravi Krishna <srkrishna1@aol.com> wrote: > >> If I have a large file with say 400 million rows, can I first split it > >> into 10 files of 40 million rows each and then fire up 10 different > >> COPY sessions , each reading from a split file, but copying into the > >> same table. I thought not. It will be great if we can do this. > > > > Yes, you can. > > > Thank you. Let me test it and see the benefit. We have a use case for this. You should of course test this on your own hardware with your own data, but here are the results of a simple benchmark (import 1 million rows into a table without indexes via different methods) I ran a few weeks ago on one of our servers: https://github.com/hjp/dbbench/blob/master/import_pg_comparison/results/claudrin.2018-09-22/results.png y axis is rows per second. x axis are different runs, sorted from slowest to fastest (so 2 is the median). As you can see it doesn't parallelize perfectly: 2 copy processes are only about 50 % faster than 1, and 4 are about 33 % faster than 2. But there is a still quite a respectable performance boost. hp PS: The script is of course in the same repo, but I didn't include the test data because I don't think I'm allowed to include that. -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment
Rob Sargent schrieb am 10.10.2018 um 00:45:> Can anyone here tell me whether or not the CopyManager facility in > JDBC via org.postgresql:postgresql:42.1.4 is internally > multithreaded? Running on CentOS 7 (all participants), java8, > postgres 10.5 An alternative to creating your own multi-threaded importer, might be to use pgLoader which supports that out of the box: https://pgloader.readthedocs.io/en/latest/pgloader.html#a-note-about-parallelism I have never used it though, but heard good things about it.
>>> >> Thank you. Let me test it and see the benefit. We have a use case for this. > Well the result is not what I expected. this is the sql I used copy TEST.TABLE1 from '/var/lib/postgresql/data/rkrishna/copytesting/xaa' with delimiter '|' NULL as '' CSV HEADER; From another session copy TEST.TABLE1 from '/var/lib/postgresql/data/rkrishna/copytesting/xab' with delimiter '|' NULL as '' CSV HEADER; Each had 16M rows. I see that one copy is blocking other. The table has no indexes while loading. Aren't they suppose to run concurrently without locking ?
> > Well the result is not what I expected. > > this is the sql I used > > copy TEST.TABLE1 from '/var/lib/postgresql/data/rkrishna/copytesting/xaa' with delimiter '|' NULL as '' CSV HEADER; > > From another session > > copy TEST.TABLE1 from '/var/lib/postgresql/data/rkrishna/copytesting/xab' with delimiter '|' NULL as '' CSV HEADER; > > Each had 16M rows. > > I see that one copy is blocking other. > > The table has no indexes while loading. > > Aren't they suppose to run concurrently without locking ? > > Strike that. The lock wait is bogus. The data did get loaded concurrently.