Re: New addition to the merge sql standard - Mailing list pgsql-general

From Nick DeCoursin
Subject Re: New addition to the merge sql standard
Date
Msg-id CAAbvdHnsLhpDaBV5mxDcjcfhgmiVUFSQBZPaM=uGRP6ScThW_Q@mail.gmail.com
Whole thread Raw
In response to Re: New addition to the merge sql standard  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-general
> "ignore" may not be what you want, though.  Perhaps the fact that insert
> (coming from the NOT MATCHED clause) fails (== conflicts with a tuple
> concurrently inserted in an unique or exclusion constraint) should
> transform the row operation into a MATCHED case, so it'd fire the other
> clauses in the overall MERGE operation.  Then you could add a WHEN
> MATCHED DO NOTHING case which does the ignoring that you want; or just
> let them be handled by WHEN MATCHED UPDATE or whatever.  But you may
> need some way to distinguish rows that appeared concurrently from rows
> that were there all along.

Your suggestion (or idea) would also be good with me, but that would seem to me to be considerably more complex to implement and to use, I think.

In a similar sense, I think perhaps it could be good to add the complete `insert ... on conflict` syntax and functionality onto the insert part of merge (including handling unique violations with an update), but I haven't completely thought it through. This is similar to what you seemed to be brainstorming, but rather than going back into the match statements, it would look like this perhaps (with the new additions underlined):

merge into some_table st
using different_table as dt
on (st.id = dt.st_id)
when matched do nothing
when not matched then
  insert (cola, colb, colc)
  values ("uniquekey", "gal", "asfa")
  on conflict (cola) do update set
  colb = "gal",
  colc = "asfa"
;


As a result, `merge` would essentially have all the functionality that `insert ... on conflict` has, so that perhaps (perhaps!) `merge` >= 'insert' in every case except for simply inserting raw data without a select.

I personally think adding the `on conflict ...` syntax onto merge insert like ^ this would be better than trying to go back into the merge's match statements.

However, even if it were desired to add the complete `on conflict do update ...` functionality, I would think it would be better to take one step at a time by just first adding the ignore functionality (of ignoring rows that cause unique violation exceptions) which could be called `on conflict do nothing`, leaving open the possibility for future expansion with the `on conflict do update set ...`.

> In regards to the SQL standard, I hope what you're saying is merely not
> documented by them.  If it indeed isn't, it may be possible to get them
> to accept some new behavior, and then I'm sure we'd consider
> implementing it.  If your suggestion goes against what they already
> have, I'm afraid you'd be doomed.  So the next question is, how do other
> implementations handle this case you're talking about?  SQL Server, DB2
> and Oracle being the relevant ones.

I would suppose it's merely not documented but yet that there's nothing contradictory to it either. Unfortunately, I don't know how other databases do it, but I would think probably none of them do it yet.

> Assuming the idea is good and there are no conflicts, then maybe it's
> just lack of round tuits.

I think it's just an advance feature that's hard to grasp, understand, and develop. But as our database systems become more and more advanced, advance features like this become more and more feasible.

You know, Postgres is sort of reaching that point - I believe - where it's looking around, like, ok what next? We've done everything, while other databases are still struggling under structural problems, postgres is starting to "push the envelope".

> I vaguely recall thinking about this, and noticing that implementing
> something of this sort would require messing around with the ExecInsert
> interface.  It'd probably require splitting it in pieces, similar to how
> ExecUpdate was split.
>
> There are some comments in the code about possible "live-locks" where
> merge would be eternally confused between inserting a new row which it
> then wants to delete; or something like that.  For sure we would need to
> understand the concurrent behavior of this new feature very clearly.
>
> An interesting point is that our inserts *wait* to see whether the
> concurrent insertion commits or aborts, when a unique constraint is
> involved.  I'm not sure you want to have MERGE blocking on concurrent
> inserts.  This is all assuming READ COMMITTED semantics; on REPEATABLE
> READ or higher, I think you're just screwed, because of course MERGE is
> not going to get a snapshot that sees the rows inserted by transactions
> that started after.
>
> You'd need to explore all this very carefully.

Yeah. I would have liked to have been able to volunteer to do it myself, but it seems based on your description to be way over my head 😄. C is definitely not my forte, and I don't have any postgres development experience (actually we did go into the postgres source for a database project in college haha).

However, it might be pertinent for me to reference this under the READ COMMITTED semantics:

MERGE allows the user to specify various combinations of INSERTUPDATE and DELETE subcommands. A MERGE command with both INSERT and UPDATE subcommands looks similar to INSERT with an ON CONFLICT DO UPDATE clause but does not guarantee that either INSERT or UPDATE will occur. If MERGE attempts an UPDATE or DELETE and the row is concurrently updated but the join condition still passes for the current target and the current source tuple, then MERGE will behave the same as the UPDATE or DELETE commands and perform its action on the updated version of the row. However, because MERGE can specify several actions and they can be conditional, the conditions for each action are re-evaluated on the updated version of the row, starting from the first action, even if the action that had originally matched appears later in the list of actions. On the other hand, if the row is concurrently updated or deleted so that the join condition fails, then MERGE will evaluate the condition's NOT MATCHED actions next, and execute the first one that succeeds. If MERGE attempts an INSERT and a unique index is present and a duplicate row is concurrently inserted, then a uniqueness violation error is raised; MERGE does not attempt to avoid such errors by restarting evaluation of MATCHED conditions.

With this in mind, the `merge` statement doesn't block on concurrent inserts, nor is that necessary. The merge semantics imply that there is no blocking/waiting. Deriving from this or in tandem with this, the insert within merge doesn't need to do any blocking or waiting either, only when it actually performs the job of committing the insert, if this fails, then perform the `on conflict do nothing`. Therefore, due to the original merge semantics, merge insert doesn't need to follow the same requirements as normal `insert`, and it doesn't need to *wait* - it's a best effort thing.

In my opinion, `merge` is meant for batch operations of large data, and that's the best way to think about it. It's not meant for perfectly serializable data. It's meant for moving huge datasets efficiently in a best effort means. 

Cheers,
Nick

On Thu, Nov 16, 2023 at 6:13 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2023-Nov-16, Nick DeCoursin wrote:

> In my opinion, it would be better for merge to offer the functionality to
> simply ignore the rows that cause unique violation exceptions instead of
> tanking the whole query.

"ignore" may not be what you want, though.  Perhaps the fact that insert
(coming from the NOT MATCHED clause) fails (== conflicts with a tuple
concurrently inserted in an unique or exclusion constraint) should
transform the row operation into a MATCHED case, so it'd fire the other
clauses in the overall MERGE operation.  Then you could add a WHEN
MATCHED DO NOTHING case which does the ignoring that you want; or just
let them be handled by WHEN MATCHED UPDATE or whatever.  But you may
need some way to distinguish rows that appeared concurrently from rows
that were there all along.

In regards to the SQL standard, I hope what you're saying is merely not
documented by them.  If it indeed isn't, it may be possible to get them
to accept some new behavior, and then I'm sure we'd consider
implementing it.  If your suggestion goes against what they already
have, I'm afraid you'd be doomed.  So the next question is, how do other
implementations handle this case you're talking about?  SQL Server, DB2
and Oracle being the relevant ones.

Assuming the idea is good and there are no conflicts, then maybe it's
just lack of round tuits.

Happen to have some?

I vaguely recall thinking about this, and noticing that implementing
something of this sort would require messing around with the ExecInsert
interface.  It'd probably require splitting it in pieces, similar to how
ExecUpdate was split.

There are some comments in the code about possible "live-locks" where
merge would be eternally confused between inserting a new row which it
then wants to delete; or something like that.  For sure we would need to
understand the concurrent behavior of this new feature very clearly.


An interesting point is that our inserts *wait* to see whether the
concurrent insertion commits or aborts, when a unique constraint is
involved.  I'm not sure you want to have MERGE blocking on concurrent
inserts.  This is all assuming READ COMMITTED semantics; on REPEATABLE
READ or higher, I think you're just screwed, because of course MERGE is
not going to get a snapshot that sees the rows inserted by transactions
that started after.

You'd need to explore all this very carefully.

--
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

pgsql-general by date:

Previous
From: Christoph Moench-Tegeder
Date:
Subject: Re: pg_basebackup
Next
From: David Gauthier
Date:
Subject: Can user specification of a column value be required when querying a view ?