Home > mailing lists

Re: COPY v. java performance comparison - Mailing list pgsql-general

From	Bill Moran
Subject	Re: COPY v. java performance comparison
Date	April 2, 2014 23:53:16
Msg-id	20140402165308.103b40a47c78b9f37121e348@potentialtech.com Whole thread Raw
In response to	Re: COPY v. java performance comparison (Adrian Klaver <adrian.klaver@aklaver.com>)
List	pgsql-general

Tree view

Just for a comparison ... I wrote a Java program  that copies data from
MySQL -> Postgres, using the Copy impelmentation in the JDBC driver.  I've
occasionally seen 50,000+ rows/sec from this program, but the speed is
highly dependent on the table structure.  Tables that are very wide tend
to run slower (for example).

On Wed, 02 Apr 2014 13:36:27 -0700
Adrian Klaver <adrian.klaver@aklaver.com> wrote:

> On 04/02/2014 01:14 PM, Rob Sargent wrote:
> > On 04/02/2014 01:56 PM, Steve Atkins wrote:
> >> On Apr 2, 2014, at 12:37 PM, Rob Sargent<robjsargent@gmail.com>  wrote:
> >>
> >>> I'm playing with various data models to compare performance and practicalities and not sure if I should be
surprisedby the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment. 
> >>>
> >>> One model says a genotype is defined as follows:
> >>>           Table "public.oldstyle"
> >>> +-------------+--------------+-----------+
> >>> |   Column    |     Type     | Modifiers |
> >>> +-------------+--------------+-----------+
> >>> | id          | uuid         | not null  |
> >>> | sample_name | text         | not null  |
> >>> | marker_name | text         | not null  |
> >>> | allele1     | character(1) |           |
> >>> | allele2     | character(1) |           |
> >>> +-------------+--------------+-----------+
> >>> (0. id is a Primary Key)
> >>> (1. Take what you will from the table name.)
> >>> (2. I hadn't thought of "char" type at this point)
> >>> (3. Ultimately the names would become ids, RI included)
> >>> (4. We're loading 39 samples and ~950K markers)
> >>> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY
andkilled that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first
load.By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete. 
> >>>
> >>> Would the overhead of the index likely explain this decrease in throughput?
> >>>
> >>> Impatience got the better of me and I killed the second COPY.  This time it had done 54% of the file in 6.75
hours,extrapolating to roughly 12 hours to do the whole thing. 
> >> That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?
> >>
> >> Have you readhttp://www.postgresql.org/docs/current/interactive/populate.html  ?
> >>
> >> Cheers,
> >>    Steve
> >>
> > The copy command was pretty vanilla:
> >
> >     copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
> >     with delimiter ' ';
> >
> > I've been to that page, but (as I read them) none sticks out as a sure
> > thing.  I'm not so worried about the actual performance as I am with the
> > relative throughput (sixes so far).
>
> Have you looked at the Postgres logs from that time period to see if
> there is anything of interest, say complaining about checkpoints.
>
> >
> > I'm not cpu bound, but I confess I didn't look at io stats during the
> > copy runs. I just assume it was pegged :)
> >
> > Thanks,
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


--
Bill Moran <wmoran@potentialtech.com>

pgsql-general by date:

From: Shaun Thomas
Date: 02 April 2014, 23:46:10
Subject: Re: Bug in user pg_ident.conf mapping code?

From: Steve Atkins
Date: 02 April 2014, 23:55:20
Subject: Re: COPY v. java performance comparison

Re: COPY v. java performance comparison - Mailing list pgsql-general

Previous

Next