Home > mailing lists

Re: COPY v. java performance comparison - Mailing list pgsql-general

From	Jeff Janes
Subject	Re: COPY v. java performance comparison
Date	April 3, 2014 01:36:51
Msg-id	CAMkU=1xRuwr92amUVjFZA02M_iegY9PZApMqTiu+Qi3tNfBTmA@mail.gmail.com Whole thread Raw
In response to	COPY v. java performance comparison (Rob Sargent <robjsargent@gmail.com>)
Responses	Re: COPY v. java performance comparison (Rob Sargent <robjsargent@gmail.com>) Re: COPY v. java performance comparison (Rob Sargent <robjsargent@gmail.com>)
List	pgsql-general

Tree view

On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:

I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
         Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null |
| sample_name | text         | not null |
| marker_name | text         | not null |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec). Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Absolutely.

Impatience got the better of me and I killed the second COPY. This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

Are you sure you actually dropped the indices? (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers, constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the values in sort-order?

Cheers,

Jeff

pgsql-general by date:

From: Shaun Thomas
Date: 03 April 2014, 01:09:51
Subject: Re: SSD Drives

From: Rob Sargent
Date: 03 April 2014, 01:46:14
Subject: Re: COPY v. java performance comparison

Re: COPY v. java performance comparison - Mailing list pgsql-general

Previous

Next