Re: COPY v. java performance comparison - Mailing list pgsql-general

From Rob Sargent
Subject Re: COPY v. java performance comparison
Date
Msg-id 533D867D.6020603@gmail.com
Whole thread Raw
In response to Re: COPY v. java performance comparison  (Thomas Kellerer <spam_eater@gmx.net>)
Responses Re: COPY v. java performance comparison  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-general
On 04/03/2014 09:01 AM, Thomas Kellerer wrote:
Rob Sargent, 02.04.2014 21:37:
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+
hours (800+ records/sec).  Then I tried COPY and killed that after
11.25 hours when I realised that I had added on non-unque index on
the name fields after the first load. By that point is was on line
28301887, so ~0.75 done which implies it would have take ~15hours to
complete.

Would the overhead of the index likely explain this decrease in
throughput?

Impatience got the better of me and I killed the second COPY.  This
time it had done 54% of the file in 6.75 hours, extrapolating to
roughly 12 hours to do the whole thing.

That matches up with the java speed. Not sure if I should be elated
with jOOQ or disappointed with COPY.

This is not what I see with COPY FROM STDIN

When I load 2million rows using a batch size of 1000 with plain JDBC that takes about 4 minutes

Loading the same file through Java and COPY FROM STDIN takes about 4 seconds

The table looks like this:
               Table "public.products"     Column       |          Type          | Modifiers
-------------------+------------------------+-----------product_id        | integer                | not nullean_code          | bigint                 | not nullproduct_name      | character varying(100) | not nullmanufacturer_name | character varying      | not nullprice             | numeric(10,2)          | not nullpublish_date      | date                   | not null
Indexes:   "products_pkey" PRIMARY KEY, btree (product_id)   "idx_publish_date" btree (publish_date, product_id)


During the load both indexes are present.

Regards
Thomas

Thomas thanks for these numbers. 

I have to straighten out my environment, which I admit I was hoping to avoid. I reset checkpoint_segments to 12 and restarted my server.
I kicked of the COPY at 19:00. That generated a couple of the "too frequent" statements but 52 "WARNING:  pgstat wait timeout" lines during the next 8 hours
starting at 00:37 (5 hours in) 'til finally keeling over at 03:04 on line 37363768.  That's the last line of the input so obviously I didn't flush my last println properly. I'm beyond getting embarrassed at this point.

Is turning auto-vacuum off a reasonable way through this?
 

pgsql-general by date:

Previous
From: Andy Colson
Date:
Subject: Re: window function help
Next
From: Andrew Sullivan
Date:
Subject: Re: Pagination count strategies