Re: COPY v. java performance comparison - Mailing list pgsql-general

From Adrian Klaver
Subject Re: COPY v. java performance comparison
Date
Msg-id 533C74CB.4070403@aklaver.com
Whole thread Raw
In response to Re: COPY v. java performance comparison  (Rob Sargent <robjsargent@gmail.com>)
Responses Re: COPY v. java performance comparison  (Bill Moran <wmoran@potentialtech.com>)
Re: COPY v. java performance comparison  (Rob Sargent <robjsargent@gmail.com>)
List pgsql-general
On 04/02/2014 01:14 PM, Rob Sargent wrote:
> On 04/02/2014 01:56 PM, Steve Atkins wrote:
>> On Apr 2, 2014, at 12:37 PM, Rob Sargent<robjsargent@gmail.com>  wrote:
>>
>>> I'm playing with various data models to compare performance and practicalities and not sure if I should be
surprisedby the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment. 
>>>
>>> One model says a genotype is defined as follows:
>>>           Table "public.oldstyle"
>>> +-------------+--------------+-----------+
>>> |   Column    |     Type     | Modifiers |
>>> +-------------+--------------+-----------+
>>> | id          | uuid         | not null  |
>>> | sample_name | text         | not null  |
>>> | marker_name | text         | not null  |
>>> | allele1     | character(1) |           |
>>> | allele2     | character(1) |           |
>>> +-------------+--------------+-----------+
>>> (0. id is a Primary Key)
>>> (1. Take what you will from the table name.)
>>> (2. I hadn't thought of "char" type at this point)
>>> (3. Ultimately the names would become ids, RI included)
>>> (4. We're loading 39 samples and ~950K markers)
>>> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY
andkilled that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first
load.By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete. 
>>>
>>> Would the overhead of the index likely explain this decrease in throughput?
>>>
>>> Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75 hours,
extrapolatingto roughly 12 hours to do the whole thing. 
>> That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?
>>
>> Have you readhttp://www.postgresql.org/docs/current/interactive/populate.html  ?
>>
>> Cheers,
>>    Steve
>>
> The copy command was pretty vanilla:
>
>     copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
>     with delimiter ' ';
>
> I've been to that page, but (as I read them) none sticks out as a sure
> thing.  I'm not so worried about the actual performance as I am with the
> relative throughput (sixes so far).

Have you looked at the Postgres logs from that time period to see if
there is anything of interest, say complaining about checkpoints.

>
> I'm not cpu bound, but I confess I didn't look at io stats during the
> copy runs. I just assume it was pegged :)
>
> Thanks,


--
Adrian Klaver
adrian.klaver@aklaver.com


pgsql-general by date:

Previous
From: Shaun Thomas
Date:
Subject: Re: Bug in user pg_ident.conf mapping code?
Next
From: Adrian Klaver
Date:
Subject: Re: Bug in user pg_ident.conf mapping code?