Thread: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBC driver

[JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBC driver

From
Christopher BROWN
Date:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either).  Is this fine, or a potential problem?   Is there a better approach?

Thanks,
Christopher

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either).  Is this fine, or a potential problem?   Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?


Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Dave Cramer
Date:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either).  Is this fine, or a potential problem?   Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?


Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Vladimir Sitnikov
Date:
>On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.


AFAIK, frontend-backend protocol allows to reuse $1 multiple times in the same query, so it seems to be indeed a bit efficient to use

values($1, $2, $3) on conflict.... set file_data=$1, file_name=$2

kind of query.

Obviously, you can use pl/pgsql procedure to workaround that. At least that would optimize network traffic.


AFAIK, JDBC references parameters by parameter position, so there's no implementation-independent way to reuse the same parameter without it sending it twice over the wire.


Vladimir

Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Vladimir Sitnikov
Date:
>On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.


AFAIK, frontend-backend protocol allows to reuse $1 multiple times in the same query, so it seems to be indeed a bit efficient to use

values($1, $2, $3) on conflict.... set file_data=$1, file_name=$2

kind of query.

Obviously, you can use pl/pgsql procedure to workaround that. At least that would optimize network traffic.


AFAIK, JDBC references parameters by parameter position, so there's no implementation-independent way to reuse the same parameter without it sending it twice over the wire.


Vladimir

Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Christopher BROWN
Date:
On 13 January 2017 at 15:20, Dave Cramer <pg@fastcrypt.com> wrote:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either).  Is this fine, or a potential problem?   Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?



To summarize, I'd like to avoid sending the same data twice, if that has any potentially significant performance issues and/or resource-usage issues.  I don't know if that's the case or not.

The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL, so I'm not against something that varies from the basic "PreparedStatement" syntax.  If there's a safe way of figuring out the intention, that'd be excellent.  Maybe if the driver has "seen" the parameter set (via PreparedStatement::setXXX) in the INSERT part for field "X", and it "sees" the field "X" is unset in the ON CONFLICT part, or if you could have some sort of back-reference (either by index, e.g. $1, $2 or whatever, or by name if that's possible).

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Christopher BROWN
Date:
On 13 January 2017 at 15:20, Dave Cramer <pg@fastcrypt.com> wrote:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either).  Is this fine, or a potential problem?   Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?



To summarize, I'd like to avoid sending the same data twice, if that has any potentially significant performance issues and/or resource-usage issues.  I don't know if that's the case or not.

The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL, so I'm not against something that varies from the basic "PreparedStatement" syntax.  If there's a safe way of figuring out the intention, that'd be excellent.  Maybe if the driver has "seen" the parameter set (via PreparedStatement::setXXX) in the INSERT part for field "X", and it "sees" the field "X" is unset in the ON CONFLICT part, or if you could have some sort of back-reference (either by index, e.g. $1, $2 or whatever, or by name if that's possible).

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBCdriver

From
Thomas Kellerer
Date:
Christopher BROWN schrieb am 13.01.2017 um 15:01:
> I've at last had an opportunity to use PostgreSQL 9.6 (instead of
> 9.4) for a project, and have been able to use the ON CONFLICT DO
> UPDATE clause. As it can involve repeating parameter values
> (including, as is my case, large binary files), I'm wondering if
> there's any performance or resource-usage side effects (or
> workarounds) when doing something like this:
>
> INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
> ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?
>
> On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.
>
> Is there a better approach?

Maybe I'm missing something but you don't need to the parameters in the UPDATE part:


INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO
  UPDATE SET file_data = excluded.file_data,
             file_name = excluded.file_name;


This assumes that in case of an update you actually want to use the same values as provided in the VALUES () clause


[JDBC] Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBCdriver

From
Thomas Kellerer
Date:
Christopher BROWN schrieb am 13.01.2017 um 15:01:
> I've at last had an opportunity to use PostgreSQL 9.6 (instead of
> 9.4) for a project, and have been able to use the ON CONFLICT DO
> UPDATE clause. As it can involve repeating parameter values
> (including, as is my case, large binary files), I'm wondering if
> there's any performance or resource-usage side effects (or
> workarounds) when doing something like this:
>
> INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
> ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?
>
> On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.
>
> Is there a better approach?

Maybe I'm missing something but you don't need to the parameters in the UPDATE part:


INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO
  UPDATE SET file_data = excluded.file_data,
             file_name = excluded.file_name;


This assumes that in case of an update you actually want to use the same values as provided in the VALUES () clause


Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Jorge Solórzano
Date:
Sadly for better or worse (I think for worse), in the JDBC specification, parameter markers are represented by positional "?" in the SQL string, so you will need to send the data twice over the wire.

BTW this use case don't really seems to be of high impact on performance, so adding an extension to the specification don't make sense IMO.

Jorge Solórzano
me.jorsol.com

On Fri, Jan 13, 2017 at 8:52 AM, Christopher BROWN <brown@reflexe.fr> wrote:
On 13 January 2017 at 15:20, Dave Cramer <pg@fastcrypt.com> wrote:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either).  Is this fine, or a potential problem?   Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?



To summarize, I'd like to avoid sending the same data twice, if that has any potentially significant performance issues and/or resource-usage issues.  I don't know if that's the case or not.

The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL, so I'm not against something that varies from the basic "PreparedStatement" syntax.  If there's a safe way of figuring out the intention, that'd be excellent.  Maybe if the driver has "seen" the parameter set (via PreparedStatement::setXXX) in the INSERT part for field "X", and it "sees" the field "X" is unset in the ON CONFLICT part, or if you could have some sort of back-reference (either by index, e.g. $1, $2 or whatever, or by name if that's possible).


Re: [JDBC] Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Jorge Solórzano
Date:
Sadly for better or worse (I think for worse), in the JDBC specification, parameter markers are represented by positional "?" in the SQL string, so you will need to send the data twice over the wire.

BTW this use case don't really seems to be of high impact on performance, so adding an extension to the specification don't make sense IMO.

Jorge Solórzano
me.jorsol.com

On Fri, Jan 13, 2017 at 8:52 AM, Christopher BROWN <brown@reflexe.fr> wrote:
On 13 January 2017 at 15:20, Dave Cramer <pg@fastcrypt.com> wrote:

On 13 January 2017 at 06:01, Christopher BROWN <brown@reflexe.fr> wrote:
Hello,

I've at last had an opportunity to use PostgreSQL 9.6 (instead of 9.4) for a project, and have been able to use the ON CONFLICT DO UPDATE clause. As it can involve repeating parameter values (including, as is my case, large binary files), I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data.

I'm not noticing side effects on localhost (but haven't profiled either).  Is this fine, or a potential problem?   Is there a better approach?

So to summarize you would like some sort of optimization where we figured out that two of the values were the same in the initial insert and the on conflict and only send it once ?



To summarize, I'd like to avoid sending the same data twice, if that has any potentially significant performance issues and/or resource-usage issues.  I don't know if that's the case or not.

The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL, so I'm not against something that varies from the basic "PreparedStatement" syntax.  If there's a safe way of figuring out the intention, that'd be excellent.  Maybe if the driver has "seen" the parameter set (via PreparedStatement::setXXX) in the INSERT part for field "X", and it "sees" the field "X" is unset in the ON CONFLICT part, or if you could have some sort of back-reference (either by index, e.g. $1, $2 or whatever, or by name if that's possible).


Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Brad DeJong
Date:

Christopher Brown wrote:

> ... I'd like to avoid sending the same data twice ...

>  ... when doing something like this:

>     INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

>     ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

> ... The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL,

> so I'm not against something that varies from the basic "PreparedStatement" syntax.

> If there's a safe way of figuring out the intention, that'd be excellent. ...

 

I find it easiest to make the intention explicit like this. This is also a more conventient syntax for handling multiple rows because you just extend the "values (?, ?, ?)" to values "(?, ?, ?), (?, ?, ?), (?, ?, ?)".

 

    with params (c_uuid, file_data, file_name) as (values (?, ?, ?))

    insert into foo (c_uuid, file_data, file_name)

    select c_uuid, file_data, file_name from params

    on conflict (c_uuid) do update set (file_data, file_name) = (select file_data, file_name from params)

 

I ran this statement in 9.6 through pgAdmin 4 in order to verify the syntax (with the ?, ?, ? replaced with literals - I have not yet figured out how to bind parameters to parameterized statements in pgAdmin 4) and the messages were ...

 

    INSERT 0 1

 

    Query returned successfully in 182 msec.

 

 

 

 

 

Christopher Brown wrote:

> ... I'd like to avoid sending the same data twice ...

>  ... when doing something like this:

>     INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

>     ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

> ... The SQL syntax (and the query workflow, and parameter order) is specific to PostgreSQL,

> so I'm not against something that varies from the basic "PreparedStatement" syntax.

> If there's a safe way of figuring out the intention, that'd be excellent. ...

 

I find it easiest to make the intention explicit like this. This is also a more conventient syntax for handling multiple rows because you just extend the "values (?, ?, ?)" to values "(?, ?, ?), (?, ?, ?), (?, ?, ?)".

 

    with params (c_uuid, file_data, file_name) as (values (?, ?, ?))

    insert into foo (c_uuid, file_data, file_name)

    select c_uuid, file_data, file_name from params

    on conflict (c_uuid) do update set (file_data, file_name) = (select file_data, file_name from params)

 

I ran this statement in 9.6 through pgAdmin 4 in order to verify the syntax (with the ?, ?, ? replaced with literals - I have not yet figured out how to bind parameters to parameterized statements in pgAdmin 4) and the messages were ...

 

    INSERT 0 1

 

    Query returned successfully in 182 msec.

 

 

 

 

 

Christopher Brown wrote:

> ...

> I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

>

>     INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

>    ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

>

> On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data. ...

 

Yes, there are lots of resource usage side effects.

 

pgjdbc transmits the byte stream over the network once for every time the stream is bound.

 

You will also need to provide independent InputStream values (or use mark()/reset()) because setBinaryStream() consumes the stream.

The nice thing with pgjdbc is that it consumes the stream at the time you make the setBinaryStream() call. "small" streams (50 KB or less)

are materialized in memory while larger streams are written to a temp file. The mark()/reset() trick doesn't work for JDBC drivers that

consume the stream lazily.

 

If your query uses simple protocol, the query is materialized as a StringBuilder/String in its entirety (including parameter replacement)

before the query is transmitted (when you execute the statement). Which, if you have large blob/clob values, can result in

OutOfMemoryError or running into the internal JVM limits on String/array size (normally Integer.MAX_VALUE).

 

 

 

 

 

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQLJDBC driver

From
Brad DeJong
Date:

Christopher Brown wrote:

> ...

> I'm wondering if there's any performance or resource-usage side effects (or workarounds) when doing something like this:

>

>     INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)

>    ON CONFLICT (c_uuid) DO UPDATE SET file_data = ?, file_name = ?

>

> On the Java side, that means invoking PreparedStatement::setBinaryStream TWICE for the same data. ...

 

Yes, there are lots of resource usage side effects.

 

pgjdbc transmits the byte stream over the network once for every time the stream is bound.

 

You will also need to provide independent InputStream values (or use mark()/reset()) because setBinaryStream() consumes the stream.

The nice thing with pgjdbc is that it consumes the stream at the time you make the setBinaryStream() call. "small" streams (50 KB or less)

are materialized in memory while larger streams are written to a temp file. The mark()/reset() trick doesn't work for JDBC drivers that

consume the stream lazily.

 

If your query uses simple protocol, the query is materialized as a StringBuilder/String in its entirety (including parameter replacement)

before the query is transmitted (when you execute the statement). Which, if you have large blob/clob values, can result in

OutOfMemoryError or running into the internal JVM limits on String/array size (normally Integer.MAX_VALUE).

 

 

 

 

 

Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBCdriver

From
Thomas Kellerer
Date:
Brad DeJong schrieb am 13.01.2017 um 18:18:
> I find it easiest to make the intention explicit like this. This is
> also a more conventient syntax for handling multiple rows because you
> just extend the "values (?, ?, ?)" to values "(?, ?, ?), (?, ?, ?),
> (?, ?, ?)".
>
>     with params (c_uuid, file_data, file_name) as (values (?, ?, ?))
>     insert into foo (c_uuid, file_data, file_name)
>     select c_uuid, file_data, file_name from params
>     on conflict (c_uuid) do update set (file_data, file_name) = (select file_data, file_name from params)
>
>
> I ran this statement in 9.6 through pgAdmin 4 in order to verify the
> syntax (with the ?, ?, ? replaced with literals - I have not yet
> figured out how to bind parameters to parameterized statements in
> pgAdmin 4) and the messages were ...

That is overly complicated because you can do a simple:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO
   UPDATE SET file_data = excluded.file_data,
              file_name = excluded.file_name;

Plus: using EXCLUDED means this will also work for multi-row inserts which your workaround does not:


INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?), (?, ?, ?), (?, ?, ?)
ON CONFLICT (c_uuid) DO
   UPDATE SET file_data = excluded.file_data,
              file_name = excluded.file_name;


[JDBC] Re: Efficient use of ON CONFLICT DO UPDATE with the PostgreSQL JDBCdriver

From
Thomas Kellerer
Date:
Brad DeJong schrieb am 13.01.2017 um 18:18:
> I find it easiest to make the intention explicit like this. This is
> also a more conventient syntax for handling multiple rows because you
> just extend the "values (?, ?, ?)" to values "(?, ?, ?), (?, ?, ?),
> (?, ?, ?)".
>
>     with params (c_uuid, file_data, file_name) as (values (?, ?, ?))
>     insert into foo (c_uuid, file_data, file_name)
>     select c_uuid, file_data, file_name from params
>     on conflict (c_uuid) do update set (file_data, file_name) = (select file_data, file_name from params)
>
>
> I ran this statement in 9.6 through pgAdmin 4 in order to verify the
> syntax (with the ?, ?, ? replaced with literals - I have not yet
> figured out how to bind parameters to parameterized statements in
> pgAdmin 4) and the messages were ...

That is overly complicated because you can do a simple:

INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?)
ON CONFLICT (c_uuid) DO
   UPDATE SET file_data = excluded.file_data,
              file_name = excluded.file_name;

Plus: using EXCLUDED means this will also work for multi-row inserts which your workaround does not:


INSERT INTO foo (c_uuid, file_data, file_name) VALUES (?, ?, ?), (?, ?, ?), (?, ?, ?)
ON CONFLICT (c_uuid) DO
   UPDATE SET file_data = excluded.file_data,
              file_name = excluded.file_name;