Re: [HACKERS] MERGE SQL Statement for PG11 - Mailing list pgsql-hackers
From | Nico Williams |
---|---|
Subject | Re: [HACKERS] MERGE SQL Statement for PG11 |
Date | |
Msg-id | 20171107232920.GZ4496@localhost Whole thread Raw |
In response to | Re: [HACKERS] MERGE SQL Statement for PG11 (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: [HACKERS] MERGE SQL Statement for PG11
|
List | pgsql-hackers |
On Thu, Nov 02, 2017 at 03:25:48PM -0700, Peter Geoghegan wrote: > Nico Williams <nico@cryptonector.com> wrote: > >A MERGE mapped to a DML like this: I needed to spend more time reading MERGE docs from other RDBMSes. The best MERGE so far is MS SQL Server's, which looks like: MERGE INTO <target> <target_alias> USING <source> <source_alias> ON (<join condition>) -- optional: WHEN MATCHED THEN UPDATESET ... -- optional: WHEN NOT MATCHED [ BY TARGET ] THEN INSERT ... -- optional: WHEN NOT MATCHED BY SOURCE THEN DELETE -- optional: OUTPUT ... ; (The other MERGEs are harder to use because they lack a WHEN NOT MATCHED BY SOURCE THEN DELETE, instead having a DELETE clause on the UPDATE, which is then difficult to use.) This is *trivial* to map to a CTE, and, in fact, I and my colleagues have resorted to hand-coded CTEs like this precisely because PG lacks MERGE (though we ourselves didn't know about MERGE -- it's new to us). If <source> is a query, then we start with a CTE for that, else if it's a view or table, then we don't setup a CTE for it. Each of the UPDATE, INSERT, and/or DELETE can be it's own CTE. If there's an OUTPUT clause, that can be a final SELECT that queries from the CTEs that ran the DMLs with RETURNING. If there's no OUTPUT then none of the DMLs need to have RETURNING, and one of them will be the main statement, rather than a CTE. The pattern is: WITH -- IFF <source> is a query: <source_alias> AS (<source>), -- IFF there's a WHEN MATCHED THEN UPDATE updates AS ( UPDATE <target> AS <target_alias> SET ... FROM <source> WHERE <join_condition> -- IFF there's an OUTPUT clause, then: RETURNING 'update' as "@action", ... ), inserts AS ( INSERT INTO <target> (<column_list>) SELECT ... FROM <source> LEFT JOIN <target> ON <join_condition> WHERE <target> IS NOT DISTINCT FROM NULL -- IFF there's a CONCURRENTLY clause: ON CONFLICT DONOTHING -- IFF there's an OUTPUT clause, then: RETURNING 'insert' as "@action", ... ), deletes AS ( DELETE FROM <target> WHERE NOT EXISTS (SELECT * FROM <source> WHERE <join_condition>) -- IFFthere's an OUTPUT clause, then: RETURNING 'delete' as "@action", ... ), -- IFF there's an OUTPUT clause SELECT * FROM updates UNION SELECT * FROM inserts UNION SELECT * FROM deletes; If there's not an output clause then one of the DMLs has to be the main statement: WITH ... DELETE ...; -- or UPDATE, or INSERT Note that if the source is a view or table and there's no OUTPUT clause, then it's one DML with up to (but not referring to) two CTEs, and in all cases the CTEs do not refer to each other. This means that the executor can parallelize all of the DMLs. If the source is a query, then that could be made a temp view to avoid having to run the query first. The CTE executor needs to learn to sometimes do this anyways, so this is good. The <deletes> CTE can be equivalently written without a NOT EXISTS: to_be_deleted AS ( SELECT <target> FROM <target> LEFT JOIN <source> ON (<join_condition>) WHERE <source>IS NOT DISTINCT FROM NULL ), deletes AS ( DELETE FROM <target> USING to_be_deleted tbd WHERE <target>= <tbd> ) if that were to run faster (probably not, since PG today would first run the to_be_deleted CTE, then the deletes CTE). I mention only because it's nice to see the symmetry of LEFT JOINs for the two WHEN NOT MATCHED cases. (Here <source> is the alias for it if one was given.) *** This mapping triggers triggers as one would expect (at least FOR EACH ROW; I expect the DMLs in CTEs should also trigger FOR EACH STATEMENT triggers, and if they don't I consider that a bug). > This is a bad idea. An implementation like this is not at all > maintainable. I beg to differ. First of all, not having to add an executor for MERGE is a win: much, much less code to maintain. The code to map MERGE to CTEs can easily be contained entirely in src/backend/parser/gram.y, which is a maintainability win: any changes to how CTEs are compiled will fail to compile if they break the MERGE mapping to CTEs. > >can handle concurrency via ON CONFLICT DO NOTHING in the INSERT CTE. > > That's not handling concurrency -- it's silently ignoring an error. Who > is to say that the conflict that IGNORE ignored is associated with a row > visible to the MVCC snapshot of the statement? IOW, why should the DELETE > affect any row? That was me misunderstanding MERGE. The DELETE is independent of the INSERT -- if an INSERT does nothing because of an ON CONFLICT DO NOTHING clause, then that won't cause that row to be deleted -- the inserts and deletes CTEs are independent in the latest mapping (see above). I believe adding ON CONFLICT DO NOTHING to the INSERT in this mapping is all that's needed to support concurrency. > There are probably a great many reasons why you need a ModifyTable > executor node that keeps around state, and explicitly indicates that a > MERGE is a MERGE. For example, we'll probably want statement level > triggers to execute in a fixed order, regardless of the MERGE, RLS will > probably require explicitly knowledge of MERGE semantics, and so on. Let's take those examples one at a time: - Is there a reason to believe that MERGE could not parallelize the DMLs it implies? If they can be parallelized, then we should not define the order in which the corresponding triggers fire. Surely we want to leave that possibility (parallelization) open, rather than exclude it. The user should not depend on the order in which the FOR EACH STATEMENT and FOR EACH ROW triggers will fire. They canalways check at the end of the transaction with DEFERRED triggers (see also my patch for ALWAYS DEFERRED constraintsand triggers). AFTER <op> FOR EACH STATEMENT triggers will only run after all the corresponding DMLs in the mapping have completed, buttheir relative orders should still not be defined. - I don't see how RLS isn't entirely orthogonal. RLS would (does) apply as normal to all of the DMLs in the mapping. If that was not the case, then there'd be a serious bug in PG right now! Using a CTE must *not* disable RLS. FOR UPDATE RLS policies are broken, however, since they don't get to see the OLD and NEW values. But that's orthogonalhere. > FWIW, your example doesn't actually have a source (just a target), so it > isn't actually like MERGE. That was my mistake -- as I say above, I had to spend more time with the various RDBMSes' MERGE docs. Nico -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: