Thread: COPY v. java performance comparison

COPY v. java performance comparison

From
Rob Sargent
Date:
I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
         Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

That matches up with the java speed. Not sure if I should be elated with jOOQ or disappointed with COPY.

Btw, I can load the roughly the same data in to the model below in 10.5 seconds. It only adds 39 very wide lines.  I haven't got to the practicality bits yet :)
     Table "public.chipcall"
+-------------+------+-----------+
|   Column    | Type | Modifiers |
+-------------+------+-----------+
| id          | uuid |           |
| sample_name | text |           |
| chip_model  | uuid |           |
| gta         | text |           |
+-------------+------+-----------+


This just a dev desktop environment:
RHEL 6.5
PostgreSQL 9.3.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4 [hm, interesting contradiction]), 64-bit
2 QuadCore
cpu MHz: 2925.878
The input file and postgres data are on separate disks, but only one controller.

Thanks in advance, even if you only read this far.

Re: COPY v. java performance comparison

From
Steve Atkins
Date:
On Apr 2, 2014, at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:

> I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised
bythe numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment. 
>
> One model says a genotype is defined as follows:
>          Table "public.oldstyle"
> +-------------+--------------+-----------+
> |   Column    |     Type     | Modifiers |
> +-------------+--------------+-----------+
> | id          | uuid         | not null  |
> | sample_name | text         | not null  |
> | marker_name | text         | not null  |
> | allele1     | character(1) |           |
> | allele2     | character(1) |           |
> +-------------+--------------+-----------+
> (0. id is a Primary Key)
> (1. Take what you will from the table name.)
> (2. I hadn't thought of "char" type at this point)
> (3. Ultimately the names would become ids, RI included)
> (4. We're loading 39 samples and ~950K markers)
> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and
killedthat after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first
load.By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete. 
>
> Would the overhead of the index likely explain this decrease in throughput?
>
> Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours,
extrapolatingto roughly 12 hours to do the whole thing. 

That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?

Have you read http://www.postgresql.org/docs/current/interactive/populate.html ?

Cheers,
  Steve





Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/02/2014 01:56 PM, Steve Atkins wrote:
On Apr 2, 2014, at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:

I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:        Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.
That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?

Have you read http://www.postgresql.org/docs/current/interactive/populate.html ?

Cheers, Steve

The copy command was pretty vanilla:
copy oldstyle from '/export/home/rob/share/testload/<file-redacted>' with delimiter ' ';
I've been to that page, but (as I read them) none sticks out as a sure thing.  I'm not so worried about the actual performance as I am with the relative throughput (sixes so far).

I'm not cpu bound, but I confess I didn't look at io stats during the copy runs. I just assume it was pegged :)

Thanks,

Re: COPY v. java performance comparison

From
Adrian Klaver
Date:
On 04/02/2014 01:14 PM, Rob Sargent wrote:
> On 04/02/2014 01:56 PM, Steve Atkins wrote:
>> On Apr 2, 2014, at 12:37 PM, Rob Sargent<robjsargent@gmail.com>  wrote:
>>
>>> I'm playing with various data models to compare performance and practicalities and not sure if I should be
surprisedby the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment. 
>>>
>>> One model says a genotype is defined as follows:
>>>           Table "public.oldstyle"
>>> +-------------+--------------+-----------+
>>> |   Column    |     Type     | Modifiers |
>>> +-------------+--------------+-----------+
>>> | id          | uuid         | not null  |
>>> | sample_name | text         | not null  |
>>> | marker_name | text         | not null  |
>>> | allele1     | character(1) |           |
>>> | allele2     | character(1) |           |
>>> +-------------+--------------+-----------+
>>> (0. id is a Primary Key)
>>> (1. Take what you will from the table name.)
>>> (2. I hadn't thought of "char" type at this point)
>>> (3. Ultimately the names would become ids, RI included)
>>> (4. We're loading 39 samples and ~950K markers)
>>> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY
andkilled that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first
load.By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete. 
>>>
>>> Would the overhead of the index likely explain this decrease in throughput?
>>>
>>> Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours,
extrapolatingto roughly 12 hours to do the whole thing. 
>> That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?
>>
>> Have you readhttp://www.postgresql.org/docs/current/interactive/populate.html  ?
>>
>> Cheers,
>>    Steve
>>
> The copy command was pretty vanilla:
>
>     copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
>     with delimiter ' ';
>
> I've been to that page, but (as I read them) none sticks out as a sure
> thing.  I'm not so worried about the actual performance as I am with the
> relative throughput (sixes so far).

Have you looked at the Postgres logs from that time period to see if
there is anything of interest, say complaining about checkpoints.

>
> I'm not cpu bound, but I confess I didn't look at io stats during the
> copy runs. I just assume it was pegged :)
>
> Thanks,


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: COPY v. java performance comparison

From
Bill Moran
Date:
Just for a comparison ... I wrote a Java program  that copies data from
MySQL -> Postgres, using the Copy impelmentation in the JDBC driver.  I've
occasionally seen 50,000+ rows/sec from this program, but the speed is
highly dependent on the table structure.  Tables that are very wide tend
to run slower (for example).

On Wed, 02 Apr 2014 13:36:27 -0700
Adrian Klaver <adrian.klaver@aklaver.com> wrote:

> On 04/02/2014 01:14 PM, Rob Sargent wrote:
> > On 04/02/2014 01:56 PM, Steve Atkins wrote:
> >> On Apr 2, 2014, at 12:37 PM, Rob Sargent<robjsargent@gmail.com>  wrote:
> >>
> >>> I'm playing with various data models to compare performance and practicalities and not sure if I should be
surprisedby the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment. 
> >>>
> >>> One model says a genotype is defined as follows:
> >>>           Table "public.oldstyle"
> >>> +-------------+--------------+-----------+
> >>> |   Column    |     Type     | Modifiers |
> >>> +-------------+--------------+-----------+
> >>> | id          | uuid         | not null  |
> >>> | sample_name | text         | not null  |
> >>> | marker_name | text         | not null  |
> >>> | allele1     | character(1) |           |
> >>> | allele2     | character(1) |           |
> >>> +-------------+--------------+-----------+
> >>> (0. id is a Primary Key)
> >>> (1. Take what you will from the table name.)
> >>> (2. I hadn't thought of "char" type at this point)
> >>> (3. Ultimately the names would become ids, RI included)
> >>> (4. We're loading 39 samples and ~950K markers)
> >>> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY
andkilled that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first
load.By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete. 
> >>>
> >>> Would the overhead of the index likely explain this decrease in throughput?
> >>>
> >>> Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75
hours,extrapolating to roughly 12 hours to do the whole thing. 
> >> That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?
> >>
> >> Have you readhttp://www.postgresql.org/docs/current/interactive/populate.html  ?
> >>
> >> Cheers,
> >>    Steve
> >>
> > The copy command was pretty vanilla:
> >
> >     copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
> >     with delimiter ' ';
> >
> > I've been to that page, but (as I read them) none sticks out as a sure
> > thing.  I'm not so worried about the actual performance as I am with the
> > relative throughput (sixes so far).
>
> Have you looked at the Postgres logs from that time period to see if
> there is anything of interest, say complaining about checkpoints.
>
> >
> > I'm not cpu bound, but I confess I didn't look at io stats during the
> > copy runs. I just assume it was pegged :)
> >
> > Thanks,
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


--
Bill Moran <wmoran@potentialtech.com>


Re: COPY v. java performance comparison

From
Steve Atkins
Date:
On Apr 2, 2014, at 1:14 PM, Rob Sargent <robjsargent@gmail.com> wrote:

> On 04/02/2014 01:56 PM, Steve Atkins wrote:
>> On Apr 2, 2014, at 12:37 PM, Rob Sargent <robjsargent@gmail.com>
>>  wrote:
>>
>>>
>>> Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours,
extrapolatingto roughly 12 hours to do the whole thing. 
>>>
>> That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?
>>
>> Have you read
>> http://www.postgresql.org/docs/current/interactive/populate.html
>>  ?
>>
>> Cheers,
>>   Steve
>>
>>
> The copy command was pretty vanilla:
> copy oldstyle from '/export/home/rob/share/testload/<file-redacted>' with delimiter ' ';
> I've been to that page, but (as I read them) none sticks out as a sure thing.  I'm not so worried about the actual
performanceas I am with the relative throughput (sixes so far). 
>
> I'm not cpu bound, but I confess I didn't look at io stats during the copy runs. I just assume it was pegged :)

If each row is, say, 100 bytes including the per-row overhead (plausible for a uuid and a couple of strings), and
you'reinserting 800 rows a second, that's 80k/second, which would be fairly pathetic. 

On my laptop (which has an SSD, sure, but it's still a laptop) I can insert 40M rows of data that has a few integers
anda few small strings in about 52 seconds. And that's just using a simple, single-threaded load using psql to run copy
fromstdin, reading from the same disk as the DB is on, with no tuning of any parameters to speed up the load. 

12 hours suggests there's something fairly badly wrong with what you're doing. I'd definitely look at the server logs,
checksystem load and double check what you're actually running. 

(Running the same thing on a tiny VM, one that shares a single RAID5 of 7200rpm drives with about 40 other VMs, takes a
shadeunder two minutes, mostly CPU bound). 

Cheers,
  Steve



Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/02/2014 02:36 PM, Adrian Klaver wrote:
On 04/02/2014 01:14 PM, Rob Sargent wrote:
On 04/02/2014 01:56 PM, Steve Atkins wrote:

Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.
That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?

Have you readhttp://www.postgresql.org/docs/current/interactive/populate.html  ?

Cheers,
   Steve

The copy command was pretty vanilla:

    copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
    with delimiter ' ';

I've been to that page, but (as I read them) none sticks out as a sure
thing.  I'm not so worried about the actual performance as I am with the
relative throughput (sixes so far).

Have you looked at the Postgres logs from that time period to see if there is anything of interest, say complaining about checkpoints.


I'm not cpu bound, but I confess I didn't look at io stats during the
copy runs. I just assume it was pegged :)

Thanks,


Well indeed there are copious LOG/HINT pairs along the lines of
LOG:  checkpoints are occurring too frequently ([8 <= n <=29] seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
and these are during non-load periods.  During the COPYs I see uncountable numbers of
WARNING:  pgstat wait timeout
As you guys have probably already concluded, I have some config/tuning to do.

Cheers,
rjs



Re: COPY v. java performance comparison

From
Adrian Klaver
Date:
On 04/02/2014 02:04 PM, Rob Sargent wrote:
> On 04/02/2014 02:36 PM, Adrian Klaver wrote:
>> On 04/02/2014 01:14 PM, Rob Sargent wrote:
>>> On 04/02/2014 01:56 PM, Steve Atkins wrote:

> Well indeed there are copious LOG/HINT pairs along the lines of
>
>     LOG:  checkpoints are occurring too frequently ([8 <= n <=29]
>     seconds apart)
>     HINT:  Consider increasing the configuration parameter
>     "checkpoint_segments".
>
> and these are during non-load periods.  During the COPYs I see
> uncountable numbers of
>
>     WARNING:  pgstat wait timeout
>
> As you guys have probably already concluded, I have some config/tuning
> to do.

At a guess some of the problem comes from here:

"The input file and postgres data are on separate disks, but only one
controller."

Have you tried moving the input file to the same disk as the server, to
factor out the controller?

>
> Cheers,
> rjs
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/02/2014 03:11 PM, Adrian Klaver wrote:
> On 04/02/2014 02:04 PM, Rob Sargent wrote:
>> On 04/02/2014 02:36 PM, Adrian Klaver wrote:
>>> On 04/02/2014 01:14 PM, Rob Sargent wrote:
>>>> On 04/02/2014 01:56 PM, Steve Atkins wrote:
>
>> Well indeed there are copious LOG/HINT pairs along the lines of
>>
>>     LOG:  checkpoints are occurring too frequently ([8 <= n <=29]
>>     seconds apart)
>>     HINT:  Consider increasing the configuration parameter
>>     "checkpoint_segments".
>>
>> and these are during non-load periods.  During the COPYs I see
>> uncountable numbers of
>>
>>     WARNING:  pgstat wait timeout
>>
>> As you guys have probably already concluded, I have some config/tuning
>> to do.
>
> At a guess some of the problem comes from here:
>
> "The input file and postgres data are on separate disks, but only one
> controller."
>
> Have you tried moving the input file to the same disk as the server,
> to factor out the controller?
>
I labour under the delusion that it is through the controller one reads
and writes and that there might be some slight advantage to not doing
both against one drive if avoidable. Wrong again?




Re: COPY v. java performance comparison

From
Jeff Janes
Date:
On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:
I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
         Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Absolutely.
 

Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

Are you sure you actually dropped the indices?  (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers, constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the values in sort-order?

Cheers,

Jeff

Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/02/2014 04:36 PM, Jeff Janes wrote:
On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:
I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
         Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Absolutely.
 

Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

Are you sure you actually dropped the indices?  (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers, constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the values in sort-order?

Cheers,

Jeff
No I'll leave the pk in at the very least.  My example load (37M records) will not be the last word by any means.  That's one experiment, if you will.  My goal is not to see how fast I can get records in, rather to see what I can expect going forward. Yes, I'm pretty sure I dropped the index afore the second kick at copy.
I'm about restart after some config changes (doubled the checkpoint_segments - I have no idea what the value should be.)

Hope you'll stay tuned.

Re: COPY v. java performance comparison

From
Adrian Klaver
Date:
On 04/02/2014 02:27 PM, Rob Sargent wrote:
> On 04/02/2014 03:11 PM, Adrian Klaver wrote:
>> On 04/02/2014 02:04 PM, Rob Sargent wrote:
>>> On 04/02/2014 02:36 PM, Adrian Klaver wrote:
>>>> On 04/02/2014 01:14 PM, Rob Sargent wrote:
>>>>> On 04/02/2014 01:56 PM, Steve Atkins wrote:

>>
>> Have you tried moving the input file to the same disk as the server,
>> to factor out the controller?
>>
> I labour under the delusion that it is through the controller one reads
> and writes and that there might be some slight advantage to not doing
> both against one drive if avoidable. Wrong again?

Well there is one way to find out:)

Might try with something less then the whole file to get come up an
approximate row/sec rate.

>
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/02/2014 04:36 PM, Jeff Janes wrote:
On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:
I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
         Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Absolutely.
 

Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

Are you sure you actually dropped the indices?  (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers, constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the values in sort-order?

Cheers,

Jeff
I restarted the java-based loading.  Still with pk in place,  (and actually generating the UUIDs etc) I got ~1.1M rows in one minute. And one "LOG:  checkpoints are occurring too frequently (24 seconds apart)" with checkpoint_segment now at 6 (v. default 3). In  plotting the records v. time in that minute at least it's pretty linear.  The time per batch is relatively constant with a few hiccups. (.5 sec per 10K lines).  So I've improved things immensely but not entirely sure that the simple config change is the reason.  If this continued I would be done inside 40 minutes.

With copy interuptus I now get 2.9M records per minute so the load would take only 12 or so minutes. I did get four reports of too-frequent checkpoints 2 at 15 seconds 2 at 9 seconds.

I'll need to let each on go to completion.

If these numbers are at all accurate and realistic, I'm still impressed with jOOQ, though COPY is rightfully back to its proper place as fastest way to load.





Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/02/2014 06:06 PM, Adrian Klaver wrote:
On 04/02/2014 02:27 PM, Rob Sargent wrote:
On 04/02/2014 03:11 PM, Adrian Klaver wrote:
On 04/02/2014 02:04 PM, Rob Sargent wrote:
On 04/02/2014 02:36 PM, Adrian Klaver wrote:
On 04/02/2014 01:14 PM, Rob Sargent wrote:
On 04/02/2014 01:56 PM, Steve Atkins wrote:


Have you tried moving the input file to the same disk as the server,
to factor out the controller?

I labour under the delusion that it is through the controller one reads
and writes and that there might be some slight advantage to not doing
both against one drive if avoidable. Wrong again?

Well there is one way to find out:)

Might try with something less then the whole file to get come up an approximate row/sec rate.



Well things slow down over time, and lots of "too frequent"s:
Have done 500 batches in 24219 ms
Have done 1000 batches in 52362 ms
Have done 1500 batches in 82256 ms
Have done 2000 batches in 113754 ms
Have done 2500 batches in 149637 ms
Have done 3000 batches in 211314 ms
Have done 3500 batches in 301989 ms
Have done 4000 batches in 430817 ms
Have done 4500 batches in 596043 ms
Have done 5000 batches in 804250 ms
where a batch is 500,000 lines.  This on the java side of course.




Re: COPY v. java performance comparison

From
Adrian Klaver
Date:
On 04/02/2014 05:30 PM, Rob Sargent wrote:
> On 04/02/2014 06:06 PM, Adrian Klaver wrote:
>> On 04/02/2014 02:27 PM, Rob Sargent wrote:
>>> On 04/02/2014 03:11 PM, Adrian Klaver wrote:
>>>> On 04/02/2014 02:04 PM, Rob Sargent wrote:
>>>>> On 04/02/2014 02:36 PM, Adrian Klaver wrote:
>>>>>> On 04/02/2014 01:14 PM, Rob Sargent wrote:
>>>>>>> On 04/02/2014 01:56 PM, Steve Atkins wrote:
>>
>>>>
>>>> Have you tried moving the input file to the same disk as the server,
>>>> to factor out the controller?
>>>>
>>> I labour under the delusion that it is through the controller one reads
>>> and writes and that there might be some slight advantage to not doing
>>> both against one drive if avoidable. Wrong again?
>>
>> Well there is one way to find out:)
>>
>> Might try with something less then the whole file to get come up an
>> approximate row/sec rate.
>>
>>>
>>
> Well things slow down over time, and lots of "too frequent"s:
>
>     Have done 500 batches in 24219 ms
>     Have done 1000 batches in 52362 ms
>     Have done 1500 batches in 82256 ms
>     Have done 2000 batches in 113754 ms
>     Have done 2500 batches in 149637 ms
>     Have done 3000 batches in 211314 ms
>     Have done 3500 batches in 301989 ms
>     Have done 4000 batches in 430817 ms
>     Have done 4500 batches in 596043 ms
>     Have done 5000 batches in 804250 ms
>
> where a batch is 500,000 lines.  This on the java side of course.

This is drive to drive or on single drive?

>
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/02/2014 08:40 PM, Adrian Klaver wrote:
On 04/02/2014 05:30 PM, Rob Sargent wrote:
On 04/02/2014 06:06 PM, Adrian Klaver wrote:
On 04/02/2014 02:27 PM, Rob Sargent wrote:
On 04/02/2014 03:11 PM, Adrian Klaver wrote:
On 04/02/2014 02:04 PM, Rob Sargent wrote:
On 04/02/2014 02:36 PM, Adrian Klaver wrote:
On 04/02/2014 01:14 PM, Rob Sargent wrote:
On 04/02/2014 01:56 PM, Steve Atkins wrote:


Have you tried moving the input file to the same disk as the server,
to factor out the controller?

I labour under the delusion that it is through the controller one reads
and writes and that there might be some slight advantage to not doing
both against one drive if avoidable. Wrong again?

Well there is one way to find out:)

Might try with something less then the whole file to get come up an
approximate row/sec rate.



Well things slow down over time, and lots of "too frequent"s:

    Have done 500 batches in 24219 ms
    Have done 1000 batches in 52362 ms
    Have done 1500 batches in 82256 ms
    Have done 2000 batches in 113754 ms
    Have done 2500 batches in 149637 ms
    Have done 3000 batches in 211314 ms
    Have done 3500 batches in 301989 ms
    Have done 4000 batches in 430817 ms
    Have done 4500 batches in 596043 ms
    Have done 5000 batches in 804250 ms

where a batch is 500,000 lines.  This on the java side of course.

This is drive to drive or on single drive?







Same as first go round, drive to drive.

Re: COPY v. java performance comparison

From
Thomas Kellerer
Date:
Rob Sargent, 02.04.2014 21:37:
> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+
> hours (800+ records/sec).  Then I tried COPY and killed that after
> 11.25 hours when I realised that I had added on non-unque index on
> the name fields after the first load. By that point is was on line
> 28301887, so ~0.75 done which implies it would have take ~15hours to
> complete.
>
> Would the overhead of the index likely explain this decrease in
> throughput?
>
> Impatience got the better of me and I killed the second COPY.  This
> time it had done 54% of the file in 6.75 hours, extrapolating to
> roughly 12 hours to do the whole thing.
>
> That matches up with the java speed. Not sure if I should be elated
> with jOOQ or disappointed with COPY.
>

This is not what I see with COPY FROM STDIN

When I load 2million rows using a batch size of 1000 with plain JDBC that takes about 4 minutes

Loading the same file through Java and COPY FROM STDIN takes about 4 seconds

The table looks like this:

                Table "public.products"
      Column       |          Type          | Modifiers
-------------------+------------------------+-----------
 product_id        | integer                | not null
 ean_code          | bigint                 | not null
 product_name      | character varying(100) | not null
 manufacturer_name | character varying      | not null
 price             | numeric(10,2)          | not null
 publish_date      | date                   | not null
Indexes:
    "products_pkey" PRIMARY KEY, btree (product_id)
    "idx_publish_date" btree (publish_date, product_id)


During the load both indexes are present.

Regards
Thomas



Re: COPY v. java performance comparison

From
Andy Colson
Date:
On 4/2/2014 7:30 PM, Rob Sargent wrote:
>>
>>>
>>
> Well things slow down over time, and lots of "too frequent"s:
>
>     Have done 500 batches in 24219 ms
>     Have done 1000 batches in 52362 ms
>     Have done 1500 batches in 82256 ms
>     Have done 2000 batches in 113754 ms
>     Have done 2500 batches in 149637 ms
>     Have done 3000 batches in 211314 ms
>     Have done 3500 batches in 301989 ms
>     Have done 4000 batches in 430817 ms
>     Have done 4500 batches in 596043 ms
>     Have done 5000 batches in 804250 ms
>
> where a batch is 500,000 lines.  This on the java side of course.
>
>
>
>

Have you watched cpu usage of the java process vs the PG process in top?
  I had a perl program importing data that was the bottleneck because it
was calling rand().  I'll bet generating uuid's is cpu intensive too.

Is the java app cpu bound?

Also watch "vmstat 3" for a minute or two.  The last two numbers (wa &
id) (some vmstat's have a steal, ignore that) will tell you if you are
io bound.

-Andy


Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/03/2014 09:01 AM, Thomas Kellerer wrote:
Rob Sargent, 02.04.2014 21:37:
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+
hours (800+ records/sec).  Then I tried COPY and killed that after
11.25 hours when I realised that I had added on non-unque index on
the name fields after the first load. By that point is was on line
28301887, so ~0.75 done which implies it would have take ~15hours to
complete.

Would the overhead of the index likely explain this decrease in
throughput?

Impatience got the better of me and I killed the second COPY.  This
time it had done 54% of the file in 6.75 hours, extrapolating to
roughly 12 hours to do the whole thing.

That matches up with the java speed. Not sure if I should be elated
with jOOQ or disappointed with COPY.

This is not what I see with COPY FROM STDIN

When I load 2million rows using a batch size of 1000 with plain JDBC that takes about 4 minutes

Loading the same file through Java and COPY FROM STDIN takes about 4 seconds

The table looks like this:
               Table "public.products"     Column       |          Type          | Modifiers
-------------------+------------------------+-----------product_id        | integer                | not nullean_code          | bigint                 | not nullproduct_name      | character varying(100) | not nullmanufacturer_name | character varying      | not nullprice             | numeric(10,2)          | not nullpublish_date      | date                   | not null
Indexes:   "products_pkey" PRIMARY KEY, btree (product_id)   "idx_publish_date" btree (publish_date, product_id)


During the load both indexes are present.

Regards
Thomas

Thomas thanks for these numbers. 

I have to straighten out my environment, which I admit I was hoping to avoid. I reset checkpoint_segments to 12 and restarted my server.
I kicked of the COPY at 19:00. That generated a couple of the "too frequent" statements but 52 "WARNING:  pgstat wait timeout" lines during the next 8 hours
starting at 00:37 (5 hours in) 'til finally keeling over at 03:04 on line 37363768.  That's the last line of the input so obviously I didn't flush my last println properly. I'm beyond getting embarrassed at this point.

Is turning auto-vacuum off a reasonable way through this?
 

Re: COPY v. java performance comparison

From
Rob Sargent
Date:
Is the java app cpu bound?
>
> Also watch "vmstat 3" for a minute or two.  The last two numbers (wa
> & id) (some vmstat's have a steal, ignore that) will tell you if you
> are io bound.
>
> -Andy
>
>

During COPY, with autovaccume off (server restarted, manual vacuum to get things going).  Immediately hit the "too frequently" gripe.  checkpoint_segments still at 12.

Not cpu bound, so waiting for IO, but surely one expects that?

vmstat 3
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  2 100148 3317248 763264 33731032    0    0     0 33355 10555 20122  1  1 84 14  0  
 1  0 100148 3311040 763272 33736972    0    0     0 18128 11118 20601  5  1 87  6  0  
 1  0 100148 3277808 763272 33769008    0    0     0 12536 11373 20551  7  2 90  1  0  
 1  1 100148 3267020 763272 33778732    0    0     0 43125 10968 20848  5  2 90  3  0  
 0  1 100148 3250528 763272 33792804    0    0     0 63567 10467 20121  2  1 81 16  0  
 0  1 100148 3258584 763276 33784308    0    0     0 17121 10553 20146  1  1 86 12  0  
 1  2 100148 3237616 763276 33802024    0    0     0 36327 11198 20151  4  2 85  9  0  
 1  0 100148 3223224 763276 33816556    0    0     0 32189 10763 19900  2  1 84 13  0  
 0  0 100148 3206740 763276 33831580    0    0     0 13233 11080 20469  7  3 89  1  0  
 0  2 100148 3204872 763276 33829580    0    0     0 75205 10500 20912  2  1 84 13  0  
 0  2 100148 3188256 763276 33847544    0    0     0 35448 11028 20788  3  2 86  9  0  
 0  2 100148 3190248 763276 33844888    0    0     0 21808 11938 20848  2  1 82 15  0  
 0  1 100148 3184656 763280 33848968    0    0     0 18592 11562 20574  1  1 84 14  0  


Re: COPY v. java performance comparison

From
Andy Colson
Date:
On 4/3/2014 12:15 PM, Rob Sargent wrote:
> Is the java app cpu bound?
>>
>  > Also watch "vmstat 3" for a minute or two.  The last two numbers (wa
>  > & id) (some vmstat's have a steal, ignore that) will tell you if you
>  > are io bound.
>  >
>  > -Andy
>  >
>  >
> During COPY, with autovaccume off (server restarted, manual vacuum to
> get things going).  Immediately hit the "too frequently" gripe.
> checkpoint_segments still at 12.
>
> Not cpu bound, so waiting for IO, but surely one expects that?
>
> vmstat 3
<snip>



Did you watch top?  Is anything at 100%?

How many cpus are you running?  (well, cpus*cores I guess)

-Andy



Re: COPY v. java performance comparison

From
Andy Colson
Date:
On 4/3/2014 12:57 PM, Andy Colson wrote:
> On 4/3/2014 12:15 PM, Rob Sargent wrote:
>> Is the java app cpu bound?
>>>
>>  > Also watch "vmstat 3" for a minute or two.  The last two numbers (wa
>>  > & id) (some vmstat's have a steal, ignore that) will tell you if you
>>  > are io bound.
>>  >
>>  > -Andy
>>  >
>>  >
>> During COPY, with autovaccume off (server restarted, manual vacuum to
>> get things going).  Immediately hit the "too frequently" gripe.
>> checkpoint_segments still at 12.
>>
>> Not cpu bound, so waiting for IO, but surely one expects that?
>>
>> vmstat 3
> <snip>
>
>
>
> Did you watch top?  Is anything at 100%?
>
> How many cpus are you running?  (well, cpus*cores I guess)
>
> -Andy
>
>
>

 From Rob, off list:

Two quad-core
model name      : Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz


So, 8 cores, so I'd bet one of your cores, probably the PG doing the
copy, is io bound.

 >> Not cpu bound, so waiting for IO, but surely one expects that?

Not with good IO sub system.  We've seen the COPY process be cpu bound
before.

Theses lines:
>  1  1 100148 3267020 763272 33778732    0    0     0 43125 10968 20848  5  2 90  3  0
>  0  1 100148 3250528 763272 33792804    0    0     0 63567 10467 20121  2  1 81 16  0
>  0  1 100148 3258584 763276 33784308    0    0     0 17121 10553 20146  1  1 86 12  0

Tell an interesting story.  On the first line, one process is running,
the cpu's are 90% idle, and 3% io wait.  On the next two lines (for the
next 6 seconds), there are no running processes, and one cpu is totally
in io wait.  (100% / 8 cores = 12%)

Its also interesting that there are zero bi (blocks in), and lots of bo
(blocks out).  I'm guessing the java app is reading completely from cache.

 From your original post of vmstat the last 5 lines have no processes
running ... I think that means for at least 15 seconds (5 lines * 3
seconds), everything was blocked waiting for IO.

Have you tried the dd benchmark tests?

http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm

-Andy


Re: COPY v. java performance comparison

From
Jeff Janes
Date:
On Wed, Apr 2, 2014 at 3:46 PM, Rob Sargent <robjsargent@gmail.com> wrote:
On 04/02/2014 04:36 PM, Jeff Janes wrote:
 
Are you sure you actually dropped the indices?  (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers, constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the values in sort-order?

Cheers,

Jeff
No I'll leave the pk in at the very least.  My example load (37M records) will not be the last word by any means.  That's one experiment, if you will.  My goal is not to see how fast I can get records in, rather to see what I can expect going forward.

You will probably want to pre-load the unindexed (including no PK) table with dummy values until you anticipate at least index will be larger than RAM.  Then build the indexes and PK; and then load some more values and time that load.  

If you just test on a small table, you will get answers that are unrealistic for the long term.  If you try to build up the table from scratch with the indexes in place, it could take 6 months to simulate 12 months of growth.

Cheers,

Jeff

Re: COPY v. java performance comparison

From
Jeff Janes
Date:
On Wed, Apr 2, 2014 at 5:11 PM, Rob Sargent <robjsargent@gmail.com> wrote:
On 04/02/2014 04:36 PM, Jeff Janes wrote:
On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:
I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
         Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Absolutely.
 

Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

Are you sure you actually dropped the indices?  (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers, constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the values in sort-order?

Cheers,

Jeff
I restarted the java-based loading.  Still with pk in place,  (and actually generating the UUIDs etc) I got ~1.1M rows in one minute. And one "LOG:  checkpoints are occurring too frequently (24 seconds apart)" with checkpoint_segment now at 6 (v. default 3). In  plotting the records v. time in that minute at least it's pretty linear.  The time per batch is relatively constant with a few hiccups. (.5 sec per 10K lines).  So I've improved things immensely but not entirely sure that the simple config change is the reason.  If this continued I would be done inside 40 minutes.

With copy interuptus I now get 2.9M records per minute so the load would take only 12 or so minutes. I did get four reports of too-frequent checkpoints 2 at 15 seconds 2 at 9 seconds.

I'll need to let each on go to completion.

If these numbers are at all accurate and realistic, I'm still impressed with jOOQ, though COPY is rightfully back to its proper place as fastest way to load.

JOOQ is probably hooking into the same API that COPY uses (or if not, it should be), so it isn't surprising that they perform similarly.
 
Cheers,

Jeff

Re: COPY v. java performance comparison

From
Jeff Janes
Date:
On Thu, Apr 3, 2014 at 9:04 AM, Rob Sargent <robjsargent@gmail.com> wrote:

 
I have to straighten out my environment, which I admit I was hoping to avoid. I reset checkpoint_segments to 12 and restarted my server.
I kicked of the COPY at 19:00. That generated a couple of the "too frequent" statements but 52 "WARNING:  pgstat wait timeout" lines during the next 8 hours
starting at 00:37 (5 hours in) 'til finally keeling over at 03:04 on line 37363768. 

Those things are not necessarily problems.  If there is a problem, those tell you places to look, nothing more.  In particular, "pgstat wait timeout" just means "Someone is beating the snot out of your hard drives, and the stat collector just happened to notice that fact".  This is uninformative, because you already know you are beating the snot out of your hard drives.  That, after all, is the point of the exercise, right?  If you saw this message when you weren't doing anything particularly strenuous, then that would be interesting.

 
That's the last line of the input so obviously I didn't flush my last println properly. I'm beyond getting embarrassed at this point.

Is turning auto-vacuum off a reasonable way through this?

No, no, no, no!  First of all, what is the "this" you are trying to get through?  Previously you said you were not trying to get the data in as fast as possible, but only to see what you can expect.  Well, now you see what you can expect.  You can expect to load at a certain speed given a certain table size, and you can expect to see some log messages about unusual activity.  Is it fast enough, or is it not fast enough?  

If it is fast enough, and if you can ignore a few dozen messages in the log file, then you are done.  (Although you will still want to assess how queries against your tables are affected by the load process, assuming your database is used for interactive queries)

If it is not fast enough, then randomly disabling important parts of the system which have nothing to do with the bulk load is probably not the way to improve things, but is an excellent way to shoot yourself in the foot.

Cheers,

Jeff

Re: COPY v. java performance comparison

From
Rob Sargent
Date:
On 04/03/2014 01:28 PM, Jeff Janes wrote:
On Thu, Apr 3, 2014 at 9:04 AM, Rob Sargent <robjsargent@gmail.com> wrote:

 
I have to straighten out my environment, which I admit I was hoping to avoid. I reset checkpoint_segments to 12 and restarted my server.
I kicked of the COPY at 19:00. That generated a couple of the "too frequent" statements but 52 "WARNING:  pgstat wait timeout" lines during the next 8 hours
starting at 00:37 (5 hours in) 'til finally keeling over at 03:04 on line 37363768. 

Those things are not necessarily problems.  If there is a problem, those tell you places to look, nothing more.  In particular, "pgstat wait timeout" just means "Someone is beating the snot out of your hard drives, and the stat collector just happened to notice that fact".  This is uninformative, because you already know you are beating the snot out of your hard drives.  That, after all, is the point of the exercise, right?  If you saw this message when you weren't doing anything particularly strenuous, then that would be interesting.

 
That's the last line of the input so obviously I didn't flush my last println properly. I'm beyond getting embarrassed at this point.

Is turning auto-vacuum off a reasonable way through this?

No, no, no, no!  First of all, what is the "this" you are trying to get through?  Previously you said you were not trying to get the data in as fast as possible, but only to see what you can expect.  Well, now you see what you can expect.  You can expect to load at a certain speed given a certain table size, and you can expect to see some log messages about unusual activity.  Is it fast enough, or is it not fast enough?  

If it is fast enough, and if you can ignore a few dozen messages in the log file, then you are done.  (Although you will still want to assess how queries against your tables are affected by the load process, assuming your database is used for interactive queries)

If it is not fast enough, then randomly disabling important parts of the system which have nothing to do with the bulk load is probably not the way to improve things, but is an excellent way to shoot yourself in the foot.

Cheers,

Jeff
Points well taken.

Others in this thread have suggested that I should in fact expect higher through-put so I've been angling at that for a bit.