Home > mailing lists

COPY v. java performance comparison - Mailing list pgsql-general

From	Rob Sargent
Subject	COPY v. java performance comparison
Date	April 2, 2014 19:38:03
Msg-id	533C66F4.60409@gmail.com Whole thread Raw
Responses	Re: COPY v. java performance comparison Re: COPY v. java performance comparison Re: COPY v. java performance comparison
List	pgsql-general

Tree view

I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:

         Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null |
| sample_name | text         | not null |
| marker_name | text         | not null |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)

I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec). Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Impatience got the better of me and I killed the second COPY. This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

That matches up with the java speed. Not sure if I should be elated with jOOQ or disappointed with COPY.

Btw, I can load the roughly the same data in to the model below in 10.5 seconds. It only adds 39 very wide lines. I haven't got to the practicality bits yet :)

     Table "public.chipcall"
+-------------+------+-----------+
|   Column    | Type | Modifiers |
+-------------+------+-----------+
| id          | uuid |           |
| sample_name | text |           |
| chip_model | uuid |           |
| gta         | text |           |
+-------------+------+-----------+

This just a dev desktop environment:
RHEL 6.5
PostgreSQL 9.3.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4 [hm, interesting contradiction]), 64-bit
2 QuadCore cpu MHz: 2925.878
The input file and postgres data are on separate disks, but only one controller.

Thanks in advance, even if you only read this far.

pgsql-general by date:

From: Bret Stern
Date: 02 April 2014, 19:36:20
Subject: SSD Drives

From: David Johnston
Date: 02 April 2014, 19:49:51
Subject: Re: pg_stat_activity

COPY v. java performance comparison - Mailing list pgsql-general

Previous

Next