Re: [EXTERNAL]: Re: [EXTERNAL]: Re: UPSERT in Postgres - Mailing list pgsql-general

From Louis Tian
Subject Re: [EXTERNAL]: Re: [EXTERNAL]: Re: UPSERT in Postgres
Date
Msg-id SY4P282MB10528232872C012334DEEA30A6949@SY4P282MB1052.AUSP282.PROD.OUTLOOK.COM
Whole thread Raw
In response to Re: [EXTERNAL]: Re: UPSERT in Postgres  (Israel Brewster <ijbrewster@alaska.edu>)
List pgsql-general



From: Israel Brewster <ijbrewster@alaska.edu>
Sent: Sunday, April 9, 2023 3:09 AM
To: Louis Tian <louis.tian@aquamonix.com.au>
Cc: Peter Geoghegan <pg@bowt.ie>; pgsql-general@lists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: [EXTERNAL]: Re: [EXTERNAL]: Re: UPSERT in Postgres
 
Thanks Israel. Your example really helped me to understand where we differ. 

Not necessarily. Consider the following UPSERT statement:
> INSERT INTO customer_order_counts (customer_id,order_count) VALUES (123456, 1) ON CONFLICT (customer_id) DO UPDATE SET order_count=customer_order_counts.order_count+1;
 
Yes, you are 100% right that is not idempotent statement. But that's not because UPSERT is idempotent ​*operation*. I am making a distinction between an operation and statement here. 
This statement is not idempotent is not because UPSERT operation is not idempotent, it is because the value being upsert is different in the statement each time you run that statement. 

Not really. Generally when I am doing an UPSERT, I am NOT using the primary key, but rather some other UNIQUE-ly indexed column(s). My primary key is typically an ID column that is defined as a serial, > and automatically generated by the database. The unique column I use for the upset, however, is generally something that would identify the row to a human - such as Station ID and timestamp 
> columns in a database I have of seismic readings. Each reading gets a unique ID (the primary key) that identifies it to the database and is used in joins. However, occasionally a reading gets updated, so I > update the row, based not on the primary key (which I don’t know for incoming data), but on the station and timestamp. UPSERT, but not based on the primary key.

I am using "primary key" to refer to the "row identifier" in a conceptual way. 
Definition from Wiki, "In the relational model of databases, a primary key is a specific choice of a minimal set of attributes (columns) that uniquely specify a tuple (row) in a relation (table)." 
So, it's not necessarily the primary key you declared on the table (for example, the id column in your example). 
To put it in another way, the Station ID and timestamp columns work in the capability of a primary key.
I was merely setting a scene in my orginal email. What I was trying to say is "UPSERT only makes sense when each row is uniquely identifiable".
So please don't get me wrong there, I am not arguing that being able to do on conflict on any unique index is wrong or bad. 

Cheers,
Louis Tian

> On Apr 6, 2023, at 4:49 PM, Louis Tian <louis.tian@aquamonix.com.au> wrote:
>
> Hi Peter,
>
> Thanks for your reply. Appreciate the help and discussion.
>
>>  In general UPSERT (or any definition of it that I can think of) does
>> not imply idempotency.
>
> "Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application." from Wikipedia.
> the concept of Idempotence when applies to HTTP is consistent with the above. https://developer.mozilla.org/en-US/docs/Glossary/Idempotent. Or are you going by a different defintion that I am not aware of?
> If you execute the same upsert multiple times, the state of the database remains the same as if only execute once.

Not necessarily. Consider the following UPSERT statement:

INSERT INTO customer_order_counts (customer_id,order_count) VALUES (123456, 1) ON CONFLICT (customer_id) DO UPDATE SET order_count=customer_order_counts.order_count+1;

That is completely valid (I tested it), and actually makes sense as something you might want to do - keep track of how many orders a customer has placed, for example if you only keep the order records for 6 months, but still want to know the total number of orders the customer has placed. If it is a new customer, you insert a record for the customer with an order count of 1. Otherwise, you update the record to increment the order count. Clearly this is NOT an idempotent operation - every time you run it, it changes the order count, so the state of the database does NOT remain the same as if you only execute it once.

> If a row already exists, the first statement will update the row so does any subsequent statements. executing the same update multiple time is the same as executing it only once.
> If the row doesn't exist, the first statement will insert that row and any subsequent will try to update, but the update has no real effect since it the value is exactly the same as the insert.
> So by defintion, upsert is idempotent.
>
>> It could just be a unique index or a unique constraint. So you can
>> upsert on any individual unique constraint/index, or the primary key.
>> Of course there might be several on a given table, but you can only
>> use one as the "conflict arbiter" per statement.
>
> Understand that I can use any unique constraint with on conflict.
> But semantically the only correct one is the primary key, since that's what identifies a row logically.

Not really. Generally when I am doing an UPSERT, I am NOT using the primary key, but rather some other UNIQUE-ly indexed column(s). My primary key is typically an ID column that is defined as a serial, and automatically generated by the database. The unique column I use for the upset, however, is generally something that would identify the row to a human - such as Station ID and timestamp  columns in a database I have of seismic readings. Each reading gets a unique ID (the primary key) that identifies it to the database and is used in joins. However, occasionally a reading gets updated, so I update the row, based not on the primary key (which I don’t know for incoming data), but on the station and timestamp. UPSERT, but not based on the primary key.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> In that sense, any unique column(s) is a potential candidate for primary key.
> It's more of a pedantic point rather than pragmatic one.
> It's less of a problem for PostgreSQL where the semantic importance of primary key is not manifested at implementation level, since all index points to the tuple directly
> Whereas it is more import for Databaes like MySQL where the secondary index points to the primary key index.
--
Message  protected by MailGuard: e-mail anti-virus, anti-spam and content filtering.https://www.mailguard.com.au/mg
Click here to report this message as spam:
https://console.mailguard.com.au/ras/25ah5yO8hL/7ACQJux9EWGExcgeAdX4X1/5.6

pgsql-general by date:

Previous
From: thayanban thay
Date:
Subject: Postgresql Upgrade from 10 to 14
Next
From: Louis Tian
Date:
Subject: Re: [EXTERNAL]: Re: [EXTERNAL]: Re: UPSERT in Postgres