Thread: Using a COPY...FROM through JDBC?

Using a COPY...FROM through JDBC?

From

Steve Wampler

Date:

05 June 2004, 20:12:39

I've got a simple database (no indices, 6 columns) that I need
to write data quickly into through JDBC connections from
multiple such connections simultaneously in a distributed
environment.  (This is going to be a message logging service
for software generated messages.)

Using a PreparedStatement, I can get about 400/s inserted.  If I
(on the java side) buffer up the entries and dump them in large
transaction blocks I can push this up to about 1200/s.  I'd
like to go faster.  One approach that I think might be
promising would be to try using a COPY command instead of
an INSERT, but I don't have a file for input, I have a
Java collection, so COPY isn't quite right.  Is there anyway to
efficiently use COPY without having to create a file (remember
that the java apps are distributed on a LAN and aren't running
on the DB server.)  Is this a dead end because of the way
COPY is implemented to only use a file?

Is there something else I can do?  Ultimately, this will end
up on a machine running 1+0 RAID, so I expect that will give
me some performance boost as well, but I'd like to push it
up as best I can with my current hardware setup.

Thanks for any advice!
-Steve
--
Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Re: Using a COPY...FROM through JDBC?

From

Markus Schaber

Date:

06 June 2004, 22:26:03

Hi, Steve,

On Sat, 05 Jun 2004 13:12:29 -0700
Steve Wampler <swampler@noao.edu> wrote:

> I've got a simple database (no indices, 6 columns) that I need
> to write data quickly into through JDBC connections from
> multiple such connections simultaneously in a distributed
> environment.  (This is going to be a message logging service
> for software generated messages.)
> Using a PreparedStatement, I can get about 400/s inserted.  If I
> (on the java side) buffer up the entries and dump them in large
> transaction blocks I can push this up to about 1200/s.  I'd
> like to go faster.  One approach that I think might be
> promising would be to try using a COPY command instead of
> an INSERT, but I don't have a file for input, I have a
> Java collection, so COPY isn't quite right.  Is there anyway to
> efficiently use COPY without having to create a file (remember
> that the java apps are distributed on a LAN and aren't running
> on the DB server.)  Is this a dead end because of the way
> COPY is implemented to only use a file?

We also found that using the psql frontend, using COPY seems to give a
factor 10 or more speedup. Sadly, as far as I learned, the current JDBC
driver does not support COPY ... FROM STDIN.

As a very bad workaround, it might be acceptable to use Runtime.exec()
to start the psql command line tool, and issue the statement there, or
even add a C-lib via JNI. Of course, the best "workaround" would be to
implement COPY support for the driver, and send the Patches to the
PGJDBC team for inclusion :-)

We also had to do some trickery to get instertion of lots of rows fast.
We dit lots of benchmarks, and currently use the following method:

Our input data is divided into chunks (the optimal size depends on the
machine, and seems to be between 250 and 3000). As the current pgjdbc
preparedStatements implementation just does a text replacement, but we
wantedto get the last bit of speed out of the machine, we issue a
"PREPARE" statement for the insertion on connection setup, and then
addBatch() a "EXECUTE blubb (data, row, values)" statement.

Then we have several concurrent threads, all running essentially a {get
batch, write batch, commit} loop on their own connection. Increasing
the thread number to more than three did not show further substantial
performance improvements. This lead us to the conclusion that
concurrency can compensate for the time the postmaster is forced to wait
while it syncs the WAL to disk, but there's still a concurrency limit
inside of postgres for inserts (I presume they have to lock at some
times, the multiversioning seems not to cover inserts very well).

Also, we surprisingly found that setting the transaction isolation to
"serializable" can speed things remarkably in some cases...

> Is there something else I can do?  Ultimately, this will end
> up on a machine running 1+0 RAID, so I expect that will give
> me some performance boost as well, but I'd like to push it
> up as best I can with my current hardware setup.

As any sane setup runs with syncing enabled in the backend, and each
sync (and so each commit) at least has to write at least one block, you
can calculate the theoretical maximum number of commits your machine can
achieve.

If you have 15k rpm disks (AFAIK, the fastest one currently available),
they spin at 250 rotations per second, so you cannot have more than 250
commits per second.

Regarding the fact that your machine has to do some works between the
sync() calls (e. G. processing the whole next batch), it is very likely
that it misses the next turn, so that you're likely to get a factor 2 or
3 number in reality.

One way to overcome this limit is using multiple writer threads, and
(having a highly capable I/O sybsystem) enabling commit delay in your
backend so that you can have more than one commit during the same write
operation.

It might also help to put the WAL log to a different disk (just link or
mount or mount --bind the appropriate subdirectory in your database), or
even put the indices on a third disk (needs ugly trickery) - it's a
shame that postmaster does not really support this techniques which are
considered standard in any professional database.

If you really need much more speed, that you could try to put the WAL
on a Solid State Disk (essentially a battery-backed RAM) so you can
overcome this physical limit, or (if you really trust your hardware and
your power supply) put the WAL into a RAMDISK or switch of syncing in
your postmaster configuration.

One thing you should check is whether I/O or CPU is the limiting factor.
If you have a cpu utilization higher than 90%, than all the tricks I
told you won't help much. (But using COPY still could help a lot as it
cut's down the CPU usage very much.)

We tested with two machines, a single-processor developer machine, and a
2-way 64-Bit Itanium SMP machine. On the desktop machine, a single
thread already utilized 80% CPU, and so only small improvement was
possible using 2 or more threads.

On the SMP machine, we had substantial improvements using 2 or 3
threads, but then going up to 8 threads gave no more remarkable speedup
constantly utilizing about 120% CPU (Remember we have a 2-way machine).
I think that there are some internal postgres locks that prohibit
further concurrency for inserts in the same table.

> Thanks for any advice!

Hope, that helps,
Markus Schaber

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: Using a COPY...FROM through JDBC?

From

Markus Schaber

Date:

07 June 2004, 07:49:57

Hi, Steve,

Here are the results of some benchmarks we did inserting 30k rows into a
table, using "time psql -f blubb.sql -q dbname":

File            kingfisher              skate
30kinsert.sql   39.359s    762r/s       335.024s           90r/s
30kcommonce.sql 11.402s   2631r/s         7.086s         4233r/s
30kwithids.sql  10.138s   2959r/s         6.936s         4325r/s
30kprepare.sql   8.173s   3670r/s         5.189s         5781r/s
30kdump.sql      1.286s  23328r/s         0.785s        38216r/s
30kdumpseq.sql   1.498s  20026r/s         0.927s        32362r/s


Kingfisher is the single processor machine I mentioned yesterday night,
skate the SMP machine.

The table has five rows (bigint, bigint, double, double, timestamp
without time zone). The first of them has a "default nextval('sequence'
::text)" definition, and there are no further constraints or indices.

The 30kinsert.sql uses simple insert statements with autocommit on, and
we insert all but the first column which is filled by the default
sequence. With this test, kingfisher seems to have an irrealistic high
value of commits (syncs) per second (see what I wrote yesterday) [1],
skate has a more realistic value.

30kcommonce.sql, as suspected, gives a rather high boost by
encapsulating all into a single commit statement.

30kwithids gives a small boost by inserting pre-calculated sequence
numbers, so it seems not worth the effort to move this logic into the
application.

30kprepare prepares the insert statement, and then issues 30k EXECUTE
statements within one transaction, the speedup is noticeable.

30kdump simply inserts the 30k rows as a dump via COPY FROM STDIN. (as
with 30kwithids, the first column is contained in the insert data, so
the default value sequence is not used). Naturally, this is by far the
fastest method.

30kdumpseq.sql uses COPY, too, but omits the first column and such
utilizes the sequence generation again. This gives a noticeable 15%
slowdown, but seems to be still fast enough for our purposes. Sadly, it
is not available within jdbc.

Thanks for your patience.

Footnotes:
[1] We suspect this to be some strange interaction between ide,
cryptoloop and ext3fs, so that the sync() call somehow does not really
wait for the data to be physically written to the disk. (I really can't
imagine a crypto-looped notebook harddisk to do more syncs/second than a
SCSI-Based RAID in a server machine. We did some small benches on the
sync() / fsync() calls that seem to prove this conclusion.)

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: Using a COPY...FROM through JDBC?

From

Kris Jurka

Date:

07 June 2004, 10:22:32

On Sat, 5 Jun 2004, Steve Wampler wrote:

>
> [I want to use copy from JDBC]
>

I made a patch to the driver to support COPY as a PG extension.  The
patch required properly encoded and formatted copy data available
from an InputStream.  Following some feedback from others I began adding
the ability to handle different encodings and the ability to read and
write objects without requiring any knowledge of the copy data format.  I
got hung up on the object read/write part because of some issues with how
type conversions are done in the driver.

At the moment there is a big push being made by Oliver Jowett to get true
V3 protocol support implemented which is currently my highest priority.
Getting copy support into the JDBC driver is something I'd like to see for
7.5, but I couldn't say if that will happen or how complete it may be.
Depending on your needs perhaps the initial patch is sufficient.

http://archives.postgresql.org/pgsql-jdbc/2003-12/msg00186.php

Kris Jurka

Re: [PERFORM] Using a COPY...FROM through JDBC?

From

Steve Wampler

Date:

07 June 2004, 17:41:09

On Mon, 2004-06-07 at 02:26, Kris Jurka wrote:
> On Sat, 5 Jun 2004, Steve Wampler wrote:
>
> >
> > [I want to use copy from JDBC]
> >
>
> I made a patch to the driver to support COPY as a PG extension.
...
> http://archives.postgresql.org/pgsql-jdbc/2003-12/msg00186.php

Thanks Kris - that patch worked beautifully and bumped the
insert rate from ~1000 entries/second to ~9000 e/s in my
test code.

Here's hoping it makes it into 7.5.

I do have a little concern about what's happening in the
back end during the copy - I suspect the entire table is
locked, which may impact the performance when multiple
clients are saving entries into the table.  Anyone know
if that's how COPY works?  (For that matter, would that
also be true of a transaction consisting of a set of
inserts?)

Thanks again!
--
Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Re: [PERFORM] Using a COPY...FROM through JDBC?

From

Steve Wampler

Date:

07 June 2004, 17:59:36

On Mon, 2004-06-07 at 10:40, Steve Wampler wrote:

> Thanks Kris - that patch worked beautifully and bumped the
> insert rate from ~1000 entries/second to ~9000 e/s in my
> test code.

As a followup - that 9000 e/s becomes ~21,000 e/s if I don't
have the java code also dump the message to standard output!

--
Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Re: [PERFORM] Using a COPY...FROM through JDBC?

From

Kris Jurka

Date:

07 June 2004, 21:59:15


On Mon, 7 Jun 2004, Steve Wampler wrote:

> I do have a little concern about what's happening in the
> back end during the copy - I suspect the entire table is
> locked, which may impact the performance when multiple
> clients are saving entries into the table.  Anyone know
> if that's how COPY works?  (For that matter, would that
> also be true of a transaction consisting of a set of
> inserts?)
>

The table is not locked in either the copy or the insert case.

Kris Jurka