Thread: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBC driver

[JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBC driver

From

Christopher BROWN

Date:

13 January 2017, 17:01:53

Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either). Is this fine, or a potential problem? Is there a better approach?

Thanks,
Christopher

Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Dave Cramer

Date:

13 January 2017, 17:20:09

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:

Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either). Is this fine, or a potential problem? Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?

Dave Cramer

davec@postgresintl.com

www.postgresintl.com

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Dave Cramer

Date:

13 January 2017, 17:20:09

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:

Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either). Is this fine, or a potential problem? Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?

Dave Cramer

davec@postgresintl.com

www.postgresintl.com

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Vladimir Sitnikov

Date:

13 January 2017, 17:42:18

>On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

AFAIK, frontend-backend protocol allows to reuse $1 multiple times in the same query, so it seems to be indeed a bit efficient to use

values($1, $2, $3) on conflict.... set file_data=$1, file_name=$2

kind of query.

Obviously, you can use pl/pgsql procedure to workaround that. At least that would optimize network traffic.

AFAIK, JDBC references parameters by parameter position, so there's no implementation-independent way to reuse the same parameter without it sending it twice over the wire.

Vladimir

Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Vladimir Sitnikov

Date:

13 January 2017, 17:42:18

>On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

AFAIK, frontend-backend protocol allows to reuse $1 multiple times in the same query, so it seems to be indeed a bit efficient to use

values($1, $2, $3) on conflict.... set file_data=$1, file_name=$2

kind of query.

Obviously, you can use pl/pgsql procedure to workaround that. At least that would optimize network traffic.

AFAIK, JDBC references parameters by parameter position, so there's no implementation-independent way to reuse the same parameter without it sending it twice over the wire.

Vladimir

Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Christopher BROWN

Date:

13 January 2017, 17:52:44

On 13 January 2017 at 15:20, Dave Cramer <pg@fastcrypt.com> wrote:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either). Is this fine, or a potential problem? Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?

Dave Cramer

davec@postgresintl.com
www.postgresintl.com

To summarize, I'd like to avoid sending the same data twice, if that has any potentially significant performance issues and/or resource-usage issues. I don't know if that's the case or not.

The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL, so I'm not against something that varies from the basic "PreparedStatement" syntax. If there's a safe way of figuring out the intention, that'd be excellent. Maybe if the driver has "seen" the parameter set (via PreparedStatement::setXXX) in the INSERT part for field "X", and it "sees" the field "X" is unset in the ON CONFLICT part, or if you could have some sort of back-reference (either by index, e.g. $1, $2 or whatever, or by name if that's possible).

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Christopher BROWN

Date:

13 January 2017, 17:52:44

On 13 January 2017 at 15:20, Dave Cramer <pg@fastcrypt.com> wrote:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either). Is this fine, or a potential problem? Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?

Dave Cramer

davec@postgresintl.com
www.postgresintl.com

To summarize, I'd like to avoid sending the same data twice, if that has any potentially significant performance issues and/or resource-usage issues. I don't know if that's the case or not.

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBCdriver

From

Thomas Kellerer

Date:

13 January 2017, 18:16:43

Christopher BROWN schrieb am 13.01.2017 um 15:01:
> I've at last had an opportunity to use PostgreSQL 9.6 (instead of
> 9.4) for a project, and have been able to use the ON CONFLICT DO
> UPDATE clause. As it can involve repeating parameter values
> (including, as is my case, large binary files), I'm wondering if
> there's any performance or resource-usage side effects (or
> workarounds) when doing something like this:
>
> INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
> ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?
>
> On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.
>
> Is there a better approach?

Maybe I'm missing something but you don't need to the parameters in the UPDATE part:


INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO
  UPDATE SET file_data = excluded.file_data,
             file_name = excluded.file_name;


This assumes that in case of an update you actually want to use the same values as provided in the VALUES () clause

[JDBC] Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBCdriver

From

Thomas Kellerer

Date:

13 January 2017, 18:16:43

Christopher BROWN schrieb am 13.01.2017 um 15:01:
> I've at last had an opportunity to use PostgreSQL 9.6 (instead of
> 9.4) for a project, and have been able to use the ON CONFLICT DO
> UPDATE clause. As it can involve repeating parameter values
> (including, as is my case, large binary files), I'm wondering if
> there's any performance or resource-usage side effects (or
> workarounds) when doing something like this:
>
> INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
> ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?
>
> On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.
>
> Is there a better approach?

Maybe I'm missing something but you don't need to the parameters in the UPDATE part:


INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO
  UPDATE SET file_data = excluded.file_data,
             file_name = excluded.file_name;


This assumes that in case of an update you actually want to use the same values as provided in the VALUES () clause

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Jorge Solórzano

Date:

13 January 2017, 18:40:01

Sadly for better or worse (I think for worse), in the JDBC specification, parameter markers are represented by positional "?" in the SQL string, so you will need to send the data twice over the wire.

BTW this use case don't really seems to be of high impact on performance, so adding an extension to the specification don't make sense IMO.

Jorge Solórzano

me.jorsol.com

On Fri, Jan 13, 2017 at 8:52 AM, Christopher BROWN <brown@reflexe.fr> wrote:

On 13 January 2017 at 15:20, Dave Cramer <pg@fastcrypt.com> wrote:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either). Is this fine, or a potential problem? Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?

Dave Cramer

davec@postgresintl.com
www.postgresintl.com

To summarize, I'd like to avoid sending the same data twice, if that has any potentially significant performance issues and/or resource-usage issues. I don't know if that's the case or not.

The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL, so I'm not against something that varies from the basic "PreparedStatement" syntax. If there's a safe way of figuring out the intention, that'd be excellent. Maybe if the driver has "seen" the parameter set (via PreparedStatement::setXXX) in the INSERT part for field "X", and it "sees" the field "X" is unset in the ON CONFLICT part, or if you could have some sort of back-reference (either by index, e.g. $1, $2 or whatever, or by name if that's possible).

Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Jorge Solórzano

Date:

13 January 2017, 18:40:01

Jorge Solórzano

me.jorsol.com

On Fri, Jan 13, 2017 at 8:52 AM, Christopher BROWN <brown@reflexe.fr> wrote:

On 13 January 2017 at 15:20, Dave Cramer <pg@fastcrypt.com> wrote:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either). Is this fine, or a potential problem? Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?

Dave Cramer

davec@postgresintl.com
www.postgresintl.com

To summarize, I'd like to avoid sending the same data twice, if that has any potentially significant performance issues and/or resource-usage issues. I don't know if that's the case or not.

The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL, so I'm not against something that varies from the basic "PreparedStatement" syntax. If there's a safe way of figuring out the intention, that'd be excellent. Maybe if the driver has "seen" the parameter set (via PreparedStatement::setXXX) in the INSERT part for field "X", and it "sees" the field "X" is unset in the ON CONFLICT part, or if you could have some sort of back-reference (either by index, e.g. $1, $2 or whatever, or by name if that's possible).

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Brad DeJong

Date:

13 January 2017, 20:18:01

Christopher Brown wrote:

> ... I'd like to avoid sending the same data twice ...

> ... when doing something like this:

> INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

> ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

> ... The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL,

> so I'm not against something that varies from the basic "PreparedStatement" syntax.

> If there's a safe way of figuring out the intention, that'd be excellent. ...

I find it easiest to make the intention explicit like this. This is also a more conventient syntax for handling multiple rows because you just extend the "values (?, ?, ?)" to values "(?, ?, ?), (?, ?, ?), (?, ?, ?)".

with params (c_uuid, file_data, file_name) as (values (?, ?, ?))

insert into foo (c_uuid, file_data, file_name)

select c_uuid, file_data, file_name from params

on conflict (c_uuid) do update set (file_data, file_name) = (select file_data, file_name from params)

I ran this statement in 9.6 through pgAdmin 4 in order to verify the syntax (with the ?, ?, ? replaced with literals - I have not yet figured out how to bind parameters to parameterized statements in pgAdmin 4) and the messages were ...

INSERT 0 1

Query returned successfully in 182 msec.

Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Brad DeJong

Date:

13 January 2017, 20:18:01

Christopher Brown wrote:

> ... I'd like to avoid sending the same data twice ...

> ... when doing something like this:

> INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

> ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

> ... The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL,

> so I'm not against something that varies from the basic "PreparedStatement" syntax.

> If there's a safe way of figuring out the intention, that'd be excellent. ...

with params (c_uuid, file_data, file_name) as (values (?, ?, ?))

insert into foo (c_uuid, file_data, file_name)

select c_uuid, file_data, file_name from params

on conflict (c_uuid) do update set (file_data, file_name) = (select file_data, file_name from params)

INSERT 0 1

Query returned successfully in 182 msec.

Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Brad DeJong

Date:

13 January 2017, 20:32:07

Christopher Brown wrote:

> ...

> I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

> INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

> ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

> On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data. ...

Yes, there are lots of resource usage side effects.

pgjdbc transmits the byte stream over the network once for every time the stream is bound.

You will also need to provide independent InputStream values (or use mark()/reset()) because setBinaryStream() consumes the stream.

The nice thing with pgjdbc is that it consumes the stream at the time you make the setBinaryStream() call. "small" streams (50 KB or less)

are materialized in memory while larger streams are written to a temp file. The mark()/reset() trick doesn't work for JDBC drivers that

consume the stream lazily.

If your query uses simple protocol, the query is materialized as a StringBuilder/String in its entirety (including parameter replacement)

before the query is transmitted (when you execute the statement). Which, if you have large blob/clob values, can result in

OutOfMemoryError or running into the internal JVM limits on String/array size (normally Integer.MAX_VALUE).

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From

Brad DeJong

Date:

13 January 2017, 20:32:07

Christopher Brown wrote:

> ...

> I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

> INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

> ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

> On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data. ...

Yes, there are lots of resource usage side effects.

pgjdbc transmits the byte stream over the network once for every time the stream is bound.

You will also need to provide independent InputStream values (or use mark()/reset()) because setBinaryStream() consumes the stream.

The nice thing with pgjdbc is that it consumes the stream at the time you make the setBinaryStream() call. "small" streams (50 KB or less)

are materialized in memory while larger streams are written to a temp file. The mark()/reset() trick doesn't work for JDBC drivers that

consume the stream lazily.

If your query uses simple protocol, the query is materialized as a StringBuilder/String in its entirety (including parameter replacement)

before the query is transmitted (when you execute the statement). Which, if you have large blob/clob values, can result in

OutOfMemoryError or running into the internal JVM limits on String/array size (normally Integer.MAX_VALUE).

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBCdriver

From

Thomas Kellerer

Date:

13 January 2017, 22:17:13

Brad DeJong schrieb am 13.01.2017 um 18:18:
> I find it easiest to make the intention explicit like this. This is
> also a more conventient syntax for handling multiple rows because you
> just extend the "values (?, ?, ?)" to values "(?, ?, ?), (?, ?, ?),
> (?, ?, ?)".
>
>     with params (c_uuid, file_data, file_name) as (values (?, ?, ?))
>     insert into foo (c_uuid, file_data, file_name)
>     select c_uuid, file_data, file_name from params
>     on conflict (c_uuid) do update set (file_data, file_name) = (select file_data, file_name from params)
>
>
> I ran this statement in 9.6 through pgAdmin 4 in order to verify the
> syntax (with the ?, ?, ? replaced with literals - I have not yet
> figured out how to bind parameters to parameterized statements in
> pgAdmin 4) and the messages were ...

That is overly complicated because you can do a simple:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO
   UPDATE SET file_data = excluded.file_data,
              file_name = excluded.file_name;

Plus: using EXCLUDED means this will also work for multi-row inserts which your workaround does not:


INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?), (?, ?, ?), (?, ?, ?)
ON CONFLICT (c_uuid) DO
   UPDATE SET file_data = excluded.file_data,
              file_name = excluded.file_name;

[JDBC] Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBCdriver

From

Thomas Kellerer

Date:

13 January 2017, 22:17:13

Brad DeJong schrieb am 13.01.2017 um 18:18:
> I find it easiest to make the intention explicit like this. This is
> also a more conventient syntax for handling multiple rows because you
> just extend the "values (?, ?, ?)" to values "(?, ?, ?), (?, ?, ?),
> (?, ?, ?)".
>
>     with params (c_uuid, file_data, file_name) as (values (?, ?, ?))
>     insert into foo (c_uuid, file_data, file_name)
>     select c_uuid, file_data, file_name from params
>     on conflict (c_uuid) do update set (file_data, file_name) = (select file_data, file_name from params)
>
>
> I ran this statement in 9.6 through pgAdmin 4 in order to verify the
> syntax (with the ?, ?, ? replaced with literals - I have not yet
> figured out how to bind parameters to parameterized statements in
> pgAdmin 4) and the messages were ...

That is overly complicated because you can do a simple:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO
   UPDATE SET file_data = excluded.file_data,
              file_name = excluded.file_name;

Plus: using EXCLUDED means this will also work for multi-row inserts which your workaround does not:


INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?), (?, ?, ?), (?, ?, ?)
ON CONFLICT (c_uuid) DO
   UPDATE SET file_data = excluded.file_data,
              file_name = excluded.file_name;