Thread: Issues with upserts

Issues with upserts

From
André Hänsel
Date:

The wiki entry https://wiki.postgresql.org/wiki/UPSERT kind of suggests that with the introduction of ON CONFLICT DO UPDATE the problem of upserts is solved. But is it?

 

A common use case for upserts is to keep a table up to date from an external data source. So you might have a cron job that runs a relatively large batch of upserts every couple of minutes.

 

I have found that this use case is not adequately covered by ON CONFLICT DO UPDATE for two reasons:

- New versions are created for all rows, even if the data is identical. This quickly fills up the WAL and puts unnecessary load on the tablespace drives.

- If the conflict target is not the serial column, the sequence backing the serial column gets incremented for every row. This quickly reaches the point where the serial for new rows exceeds the range of an integer.

 

Does this mean I have to SELECT the data first and do the conflict check in the application or is there a better SQL-only way?

Re: Issues with upserts

From
Jeremy Smith
Date:



- New versions are created for all rows, even if the data is identical. This quickly fills up the WAL and puts unnecessary load on the tablespace drives.

- If the conflict target is not the serial column, the sequence backing the serial column gets incremented for every row. This quickly reaches the point where the serial for new rows exceeds the range of an integer.

 


It sounds like you aren't adding a WHERE clause to prevent the duplicate rows from being updated.  It would help if you could share your query, but in general this could look like this:

INSERT INTO my_table (col1, col2)
SELECT col1, col2 FROM other_table
ON CONFLICT (col1, col2) DO UPDATE SET (col1, col2) = (EXCLUDED.col1, EXCLUDED.col2)
WHERE (my_table.col1, my_table.col2) IS DISTINCT FROM (EXCLUDED.col1, EXCLUDED.col2);

Re: Issues with upserts

From
"David G. Johnston"
Date:
On Wednesday, July 13, 2022, André Hänsel <andre@webkr.de> wrote:

The wiki entry https://wiki.postgresql.org/wiki/UPSERT kind of suggests that with the introduction of ON CONFLICT DO UPDATE the problem of upserts is solved. But is it?

 

A common use case for upserts is to keep a table up to date from an external data source. So you might have a cron job that runs a relatively large batch of upserts every couple of minutes.

 

I have found that this use case is not adequately covered by ON CONFLICT DO UPDATE for two reasons:

- New versions are created for all rows, even if the data is identical. This quickly fills up the WAL and puts unnecessary load on the tablespace drives.

- If the conflict target is not the serial column, the sequence backing the serial column gets incremented for every row. This quickly reaches the point where the serial for new rows exceeds the range of an integer.

 

Does this mean I have to SELECT the data first and do the conflict check in the application or is there a better SQL-only way?


Well, first of all, don’t use serial/auto-increment in this situation.

Second, you need to figure out what the unique key in the provided data is and key off of that for detecting duplicates.  Ideally it already contains some bigint key column which then makes the first point easy to accept.

David J.
 

RE: Issues with upserts

From
André Hänsel
Date:

Jeremy Smith wrote:

 

It sounds like you aren't adding a WHERE clause to prevent the duplicate rows from being updated.  It would help if you could share your query, but in general this could look like this:

 

INSERT INTO my_table (col1, col2)

SELECT col1, col2 FROM other_table

ON CONFLICT (col1, col2) DO UPDATE SET (col1, col2) = (EXCLUDED.col1, EXCLUDED.col2)

WHERE (my_table.col1, my_table.col2) IS DISTINCT FROM (EXCLUDED.col1, EXCLUDED.col2) 

 

Here’s an example:

 

https://dbfiddle.uk/?rdbms=postgres_14&fiddle=b48d062d2eedbab14157359694b16081

 

CREATE TABLE t (

    id serial PRIMARY KEY,

    name text NOT NULL UNIQUE,

    address text NOT NULL

);

 

INSERT INTO t(name, address)

VALUES ('foo', 'Baker street'),('bar', 'Miller street')

ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address

WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);

 

INSERT INTO t(name, address)

VALUES ('foo', 'Baker street'),('bar', 'Miller street')

ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address

WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);

 

INSERT INTO t(name, address)

VALUES ('foo', 'Baker street'),('bar', 'Miller street')

ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address

WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);

 

INSERT INTO t(name, address)

VALUES ('foo', 'Baker street'),('bar', 'Miller street')

ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address

WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);

 

SELECT last_value FROM t_id_seq;

 

This will yield “8”, showing that new sequence numbers have been generated for each attempt.

Re: Issues with upserts

From
Tom Lane
Date:
=?utf-8?Q?Andr=C3=A9_H=C3=A4nsel?= <andre@webkr.de> writes:
> This will yield “8”, showing that new sequence numbers have been generated for each attempt.

Well, yeah, because the INSERT has to be attempted first, and
that includes forming the whole candidate row including the
nextval() result.  If you're expecting a serial ID column to not
have holes in the sequence of values, you're going to be sadly
disappointed, whether you use ON CONFLICT UPDATE or not.

            regards, tom lane



Re: Issues with upserts

From
"David G. Johnston"
Date:
On Wed, Jul 13, 2022 at 7:58 AM André Hänsel <andre@webkr.de> wrote:


SELECT last_value FROM t_id_seq;

 

This will yield “8”, showing that new sequence numbers have been generated for each attempt.


Yep, an entire able-to-be-inserted tuple is formed every time.  That requires evaluating defaults so that every column has a value.

David J.

Re: Issues with upserts

From
Adrian Klaver
Date:
On 7/13/22 07:58, André Hänsel wrote:
> Jeremy Smith wrote:

> CREATE TABLE t (
> 
>      id serial PRIMARY KEY,
> 
>      name text NOT NULL UNIQUE,
> 
>      address text NOT NULL
> 
> );

> 
> This will yield “8”, showing that new sequence numbers have been 
> generated for each attempt.
> 

If running out of id's is a concern use bigserial instead of serial as 
it uses bigint:

bigint     -9223372036854775808 to +9223372036854775807

vs

integer for serial:

integer -2147483648 to +2147483647

-- 
Adrian Klaver
adrian.klaver@aklaver.com