Thread: Guidance on INSERT RETURNING order
Hello list, I have a few clarification questions regarding using insert with returning. The use case is SQLAlchemy development, where the orm wants to insert a list of rows, get back the generated ids, defaults, etc, and match the returned values with the original list of orm objects. The following assumes a table like this CREATE TABLE t( id SERIAL, data TEXT -- type here can be anything ) On PostgreSQL, since SQLAlchemy version 1.4 (Mar 2021) this operation has used INSERT INTO t(data) VALUES ('a'), ('b'), ('c') RETURNING id but we were recently made aware that there is no guarantee on the order of the returned columns. Looking at the documentation there is no mention of the order of the RETURNING clause, but searching past discussion there are multiple indication that the order is not guaranteed, like https://www.postgresql.org/message-id/19445.1350482182%40sss.pgh.pa.us . I think the docs should mention this, similar to what the sqlite docs do at https://www.sqlite.org/lang_returning.html#limitations_and_caveats Searching the archive seems that a using the INSERT SELECT ORDER BY form should be a better solution, so the above insert should be rewritten as INSERT INTO t(data) SELECT data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, num) ORDER BY num RETURNING id to ensure that the id are created in the order specified by num. The returned id can again be in arbitrary order, but sorting them should enable correctly matching the orm object so that they can be properly updated. Is this correct? The documentation does not say anything about this, and looking at the archive it seems that it's mostly correct but not 100% guaranteed, as stated here https://www.postgresql.org/message-id/29386.1528813619%40sss.pgh.pa.us . The MSSQL docs, for example, clearly state that this is the case https://learn.microsoft.com/en-us/sql/t-sql/statements/insert-transact-sql?view=sql-server-ver16#limitations-and-restrictions , so it would be helpful if something similar were mentioned in the PostgreSQL docs. The above insert form (INSERT SELECT ORDER BY) can be used when the primary key is an auto incrementing value, in case it isn't (such as when it's an UUID), another solution must be used. Since there does not seem to be any way of getting the position of the original row inside the VALUES clause with RETURNING, the solution SQLAlchemy is implementing is to either degrade to inserts with a single value or to optionally allow the user to add a "sentinel" column to the table, so that a sequential value can be inserted into it and then returned allowing the ordering of the RETURNING clause rows: ALTER TABLE t ADD COLUMN sentinel SMALLINT INSERT INTO t(data, sentinel) VALUES ('a', 1), ('b', 2), ('c', 3) RETURNING id, sentinel Is there any better solution to achieve this? (For reference this feature is tracked in SQLAlchemy by https://github.com/sqlalchemy/sqlalchemy/issues/9618) From an ORM standpoint it would be very useful having a way of forcing the order of RETURNING to be the same as the one in VALUES, maybe with an additional keyword. Alternatively having a system column or other function that can be placed into the returning clause to return the output row position wrt the input values list, similar to what the sentinel column above does. At the very least I think the documentation could do a better job at mentioning that RETURNING order is arbitrary, and documenting that INSERT SELECT ORDER BY precesses the default in select order (if that's indeed the case) Sorry for the long email, Thanks Federico
On 4/11/23 12:47, Federico wrote: > Hello list, > > I have a few clarification questions regarding using insert with returning. > The use case is SQLAlchemy development, where the orm wants to insert > a list of rows, > get back the generated ids, defaults, etc, and match the returned values with > the original list of orm objects. > > The following assumes a table like this > > CREATE TABLE t( > id SERIAL, > data TEXT -- type here can be anything > ) > > On PostgreSQL, since SQLAlchemy version 1.4 (Mar 2021) this operation has used > > INSERT INTO t(data) VALUES ('a'), ('b'), ('c') RETURNING id > > but we were recently made aware that there is no guarantee on the > order of the returned columns. 1) Because returned data in SQL is inherently unordered. 2) What would you order by, id or data or both? > Sorry for the long email, > Thanks > > Federico > > -- Adrian Klaver adrian.klaver@aklaver.com
On 4/11/23 12:47, Federico wrote: > Hello list, > https://www.sqlite.org/lang_returning.html#limitations_and_caveats > > Searching the archive seems that a using the INSERT SELECT ORDER BY > form should be a better solution, > so the above insert should be rewritten as > > INSERT INTO t(data) > SELECT data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, > num) ORDER BY num > RETURNING id Or with i as (INSERT INTO t(data) VALUES ('a', 1), ('b', 2), ('c', 3) returning id) select i.id from i order by id; > Sorry for the long email, > Thanks > > Federico > > -- Adrian Klaver adrian.klaver@aklaver.com
Στις 11/4/23 23:06, ο/η Adrian Klaver έγραψε: > On 4/11/23 12:47, Federico wrote: >> Hello list, >> > https://www.sqlite.org/lang_returning.html#limitations_and_caveats >> >> Searching the archive seems that a using the INSERT SELECT ORDER BY >> form should be a better solution, >> so the above insert should be rewritten as >> >> INSERT INTO t(data) >> SELECT data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, >> num) ORDER BY num >> RETURNING id > > Or > > with i as (INSERT INTO t(data) VALUES ('a', 1), ('b', 2), ('c', 3) > returning id) > select i.id from i order by id; +1 for this version! > >> Sorry for the long email, >> Thanks >> >> Federico >> >> > -- Achilleas Mantzios IT DEV - HEAD IT DEPT Dynacom Tankers Mgmt
Federico <cfederico87@gmail.com> writes: > Searching the archive seems that a using the INSERT SELECT ORDER BY > form should be a better solution, > so the above insert should be rewritten as > INSERT INTO t(data) > SELECT data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, > num) ORDER BY num > RETURNING id > to ensure that the id are created in the order specified by num. The > returned id can again be in > arbitrary order, but sorting them should enable correctly matching the > orm object so that they can > be properly updated. > Is this correct? No. Sadly, adding that ORDER BY is just voodoo programming, because it applies to the result of the SELECT while promising nothing about the order in which INSERT/RETURNING will act on those rows. Re-reading that 2012 thread, the main new observation I'd make today is that parallel operation is a thing now, and it's not hard to foresee that sometime soon we'll want to parallelize INSERTs. Which'd make it *really* hard to promise anything about the order of RETURNING output. I think if you want to use RETURNING with multi-row inserts, the thing to do is more like INSERT INTO t(data) VALUES ('a'), ('b'), ('c') RETURNING data, id and then explicitly match up the returned "data" values rather than presuming they appear in the same order you wrote them in in VALUES. Admittedly this might be problematic if some of the VALUES rows are identical, but how much should you care? regards, tom lane
Thanks for the ansers > 2) What would you order by, id or data or both? by values order, (that incidentally seems to be what PG does) > with i as (INSERT INTO t(data) VALUES ('a', 1), ('b', 2), ('c', 3) > returning id) > select i.id from i order by id; The problem here is not having the auto increment id in a particular order, is that there is apparently no correlation with the position of an element in the values clause with the id generated. That's the reason for using the sentinel column in the general solution in the previous message. The extend on the use case, SQLAlchemy has 3 objects T that have T(data='a'), T(data='b'), T(data='c') but no value for the id column. The objective is to insert the 3 data values, get back the ids and correctly match them with the correct 3 objects. > No. Sadly, adding that ORDER BY is just voodoo programming, because > it applies to the result of the SELECT while promising nothing about > the order in which INSERT/RETURNING will act on those rows. I wasn't probably clear, it's fine if INSERT/RETURNING order is arbitrary, what matters is that the autoincementing values is executed in the same order as select, like mentioned in this previous message https://www.postgresql.org/message-id/29386.1528813619%40sss.pgh.pa.us Is that not the case? > Re-reading that 2012 thread, the main new observation I'd make today > is that parallel operation is a thing now, and it's not hard to foresee > that sometime soon we'll want to parallelize INSERTs. Which'd make it > *really* hard to promise anything about the order of RETURNING output. I think it's fine not promising anything about the order of RETURNING, but it would be very helpful having a way of tracking what input row generated a particular output row. Basically the sentinel case in the original post, without actually having to insert the sentinel into the table. > I think if you want to use RETURNING with multi-row inserts, the > thing to do is more like > > INSERT INTO t(data) VALUES ('a'), ('b'), ('c') RETURNING data, id > > and then explicitly match up the returned "data" values rather than > presuming they appear in the same order you wrote them in in VALUES. > Admittedly this might be problematic if some of the VALUES rows > are identical, but how much should you care? Well, the example is very easy, but it's hard to generalize when inserting multiple columns with possible complex values in them, since it would mean matching on possibly large json values, arrays, etc. So definitely not ideal Thanks, Federico On Tue, 11 Apr 2023 at 22:06, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > > On 4/11/23 12:47, Federico wrote: > > Hello list, > > > https://www.sqlite.org/lang_returning.html#limitations_and_caveats > > > > Searching the archive seems that a using the INSERT SELECT ORDER BY > > form should be a better solution, > > so the above insert should be rewritten as > > > > INSERT INTO t(data) > > SELECT data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, > > num) ORDER BY num > > RETURNING id > > Or > > with i as (INSERT INTO t(data) VALUES ('a', 1), ('b', 2), ('c', 3) > returning id) > select i.id from i order by id; > > > Sorry for the long email, > > Thanks > > > > Federico > > > > > > -- > Adrian Klaver > adrian.klaver@aklaver.com >
On Tue, 11 Apr 2023, Federico wrote: >The problem here is not having the auto increment id in a particular The id might not even be auto-increment but UUID or something… (I am surprised you would even try to insert multiple rows at once.) bye, //mirabilos -- 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
On Tue, Apr 11, 2023, at 4:22 PM, Tom Lane wrote:
Federico <cfederico87@gmail.com> writes:> Searching the archive seems that a using the INSERT SELECT ORDER BY> form should be a better solution,> so the above insert should be rewritten as> INSERT INTO t(data)> SELECT data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data,> num) ORDER BY num> RETURNING id> to ensure that the id are created in the order specified by num. The> returned id can again be in> arbitrary order, but sorting them should enable correctly matching the> orm object so that they can> be properly updated.> Is this correct?No. Sadly, adding that ORDER BY is just voodoo programming, becauseit applies to the result of the SELECT while promising nothing aboutthe order in which INSERT/RETURNING will act on those rows.Re-reading that 2012 thread, the main new observation I'd make todayis that parallel operation is a thing now, and it's not hard to foreseethat sometime soon we'll want to parallelize INSERTs. Which'd make it*really* hard to promise anything about the order of RETURNING output.
if I can state this without having RETURNING getting in the way, because we know RETURNING is not ordered.
Start with this table:
CREATE TABLE mytable (
id SERIAL PRIMARY KEY,
a INT,
b INT
)
Then insert two rows, where id SERIAL fires implicitly, assume the next value the sequence will give us is N1, and then the value after that is N2. It doesn't matter what N1 and N2 are (don't need to be consecutive) but we want N2 > N1, that is, increasing.
INSERT INTO mytable (a, b) SELECT p1, p2 FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num) ORDER BY num
Then SELECT with ORDER BY:
SELECT id, a, b FROM mytable ORDER BY id
We want the results to be:
(N1, 10, 11)
(N2, 12, 13)
and we dont want them to *ever* be:
(N1, 12, 13)
(N2, 10, 11)
that is, we want the SERIAL column (or an IDENTITY also) to be lined up with the VALUES.
From what you wrote in https://www.postgresql.org/message-id/29386.1528813619%40sss.pgh.pa.us , that seems to be exactly what you've stated, where this statement:
INSERT INTO table (a, b) SELECT p1, p2 FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num) ORDER BY num
is organized by the query planner to essentially be equivalent to this, where the nextval() is part of the SELECTed data:
INSERT INTO mytable (id, a, b) SELECT nextval('mytable_id_seq'), p1, p2 FROM
(SELECT p1, p2 FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num) ORDER BY num) as _x
in practice, we add "RETURNING id" and expect those "id" cols to be in increasing order, so we sort to match it up with the input rows.
I think if you want to use RETURNING with multi-row inserts, thething to do is more likeINSERT INTO t(data) VALUES ('a'), ('b'), ('c') RETURNING data, idand then explicitly match up the returned "data" values rather thanpresuming they appear in the same order you wrote them in in VALUES.
we're going to do that also when the table has something like a uuid for a primary key or otherwise.
Admittedly this might be problematic if some of the VALUES rowsare identical, but how much should you care?
we only do any of this if the rows have something unique in them we can hook onto.
regards, tom lane
thanks so much for replying!
On Tue, 11 Apr 2023 at 22:59, Thorsten Glaser <tg@evolvis.org> wrote: > > On Tue, 11 Apr 2023, Federico wrote: > > >The problem here is not having the auto increment id in a particular > > The id might not even be auto-increment but UUID or something… > (I am surprised you would even try to insert multiple rows at once.) Well the documentation makes no mention of any limitation on returning and the observed behaviour has consistently been that returning is in values order. Again, that was SQLAlchemy's fault for assuming this (but the docs surely did not help). Also re-reading my reply, I've made a typo there, sorry. What it should have read is: The problem here is not having the returned ids in a particular order, is that there is apparently no correlation with the position of an element in the values clause with the id generated. Of course sorting the returned ids is only viable when using a serial or identity column, that's why in the general case I've mentioned the insert with sentinel column to ask if there are better or alternative solutions. Thanks for the reply, best Federico > > bye, > //mirabilos > -- > 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-) > > > >
On Tue, 11 Apr 2023, Federico wrote: >Of course sorting the returned ids is only viable when using a serial Yes, which is why I pointed out it doesn’t have to be. >or identity column, that's why in the general case I've mentioned the >insert with sentinel column But it was pointed out that that’s not guaranteed either, unless you add that sentinel column to the table itself… bye, //mirabilos -- 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
On 4/11/23 14:37, Federico wrote: > > The problem here is not having the auto increment id in a particular > order, is that there > is apparently no correlation with the position of an element in the > values clause with the > id generated. That's the reason for using the sentinel column in the > general solution in the previous message. > > The extend on the use case, SQLAlchemy has 3 objects T that have > T(data='a'), T(data='b'), T(data='c') but no > value for the id column. The objective is to insert the 3 data values, > get back the ids and correctly match them with > the correct 3 objects. > >> No. Sadly, adding that ORDER BY is just voodoo programming, because >> it applies to the result of the SELECT while promising nothing about >> the order in which INSERT/RETURNING will act on those rows. > I wasn't probably clear, it's fine if INSERT/RETURNING order is > arbitrary, what matters is that the > autoincementing values is executed in the same order as select, like > mentioned in this > previous message > https://www.postgresql.org/message-id/29386.1528813619%40sss.pgh.pa.us > > Is that not the case? > >> Re-reading that 2012 thread, the main new observation I'd make today >> is that parallel operation is a thing now, and it's not hard to foresee >> that sometime soon we'll want to parallelize INSERTs. Which'd make it >> *really* hard to promise anything about the order of RETURNING output. > I think it's fine not promising anything about the order of RETURNING, but > it would be very helpful having a way of tracking what input row > generated a particular > output row. Basically the sentinel case in the original post, > without actually having to insert the sentinel into the table. > >> I think if you want to use RETURNING with multi-row inserts, the >> thing to do is more like >> >> INSERT INTO t(data) VALUES ('a'), ('b'), ('c') RETURNING data, id >> >> and then explicitly match up the returned "data" values rather than >> presuming they appear in the same order you wrote them in in VALUES. >> Admittedly this might be problematic if some of the VALUES rows >> are identical, but how much should you care? > Well, the example is very easy, but it's hard to generalize when > inserting multiple columns > with possible complex values in them, since it would mean matching on > possibly large json values, > arrays, etc. So definitely not ideal > > Thanks, > Federico > Can your client retain a hashmap of md5,data pairings, allowing the lookup on the way back using the returned data and supplied id?
On Tue, 11 Apr 2023 at 23:22, Thorsten Glaser <tg@evolvis.org> wrote: > > On Tue, 11 Apr 2023, Federico wrote: > > >Of course sorting the returned ids is only viable when using a serial > > Yes, which is why I pointed out it doesn’t have to be. > > >or identity column, that's why in the general case I've mentioned the > >insert with sentinel column > > But it was pointed out that that’s not guaranteed either, unless you add > that sentinel column to the table itself… I was under the impression that when using INSERT SELECT ORDER BY the sequence ids were generated using the select order. That has been mentioned in multiple other previous messages, like https://www.postgresql.org/message-id/29386.1528813619%40sss.pgh.pa.us The above does not cover all cases, but in practice serial or identity are very common, so it would be nice if at least in these cases a sentinel is not needed Thanks, Federico > bye, > //mirabilos > -- > 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-) > > > >
On Tue, 11 Apr 2023 at 23:31, Rob Sargent <robjsargent@gmail.com> wrote: > > On 4/11/23 14:37, Federico wrote: > > > > The problem here is not having the auto increment id in a particular > > order, is that there > > is apparently no correlation with the position of an element in the > > values clause with the > > id generated. That's the reason for using the sentinel column in the > > general solution in the previous message. > > > > The extend on the use case, SQLAlchemy has 3 objects T that have > > T(data='a'), T(data='b'), T(data='c') but no > > value for the id column. The objective is to insert the 3 data values, > > get back the ids and correctly match them with > > the correct 3 objects. > > > >> No. Sadly, adding that ORDER BY is just voodoo programming, because > >> it applies to the result of the SELECT while promising nothing about > >> the order in which INSERT/RETURNING will act on those rows. > > I wasn't probably clear, it's fine if INSERT/RETURNING order is > > arbitrary, what matters is that the > > autoincementing values is executed in the same order as select, like > > mentioned in this > > previous message > > https://www.postgresql.org/message-id/29386.1528813619%40sss.pgh.pa.us > > > > Is that not the case? > > > >> Re-reading that 2012 thread, the main new observation I'd make today > >> is that parallel operation is a thing now, and it's not hard to foresee > >> that sometime soon we'll want to parallelize INSERTs. Which'd make it > >> *really* hard to promise anything about the order of RETURNING output. > > I think it's fine not promising anything about the order of RETURNING, but > > it would be very helpful having a way of tracking what input row > > generated a particular > > output row. Basically the sentinel case in the original post, > > without actually having to insert the sentinel into the table. > > > >> I think if you want to use RETURNING with multi-row inserts, the > >> thing to do is more like > >> > >> INSERT INTO t(data) VALUES ('a'), ('b'), ('c') RETURNING data, id > >> > >> and then explicitly match up the returned "data" values rather than > >> presuming they appear in the same order you wrote them in in VALUES. > >> Admittedly this might be problematic if some of the VALUES rows > >> are identical, but how much should you care? > > Well, the example is very easy, but it's hard to generalize when > > inserting multiple columns > > with possible complex values in them, since it would mean matching on > > possibly large json values, > > arrays, etc. So definitely not ideal > > > > Thanks, > > Federico > > > Can your client retain a hashmap of md5,data pairings, allowing the > lookup on the way back using the returned data and supplied id? > When using unique columns or similar, that's something that is done, but if there are no unique columns in the value no match can be done reliably with the source data, since sqlalchemy is a library that allows arbitrary schemas to be generated. Thanks for the reply, Federico
On Tue, 11 Apr 2023, Federico wrote: >I was under the impression that when using INSERT SELECT ORDER BY the sequence >ids were generated using the select order. But someone said that’s not guaranteed, especially when INSERT will be parallelised later. bye, //mirabilos -- 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
>> Can your client retain a hashmap of md5,data pairings, allowing the >> lookup on the way back using the returned data and supplied id? >> > When using unique columns or similar, that's something that is done, > but if there are no unique columns in the value no match can be done > reliably with the source data, since sqlalchemy is a library that > allows arbitrary schemas to be generated. > > Thanks for the reply, > Federico So you're returned data is not what was sent to the server? Otherwise it should generate the same md5, as I understand it. Identical data would of course be a problem.
On Tue, 11 Apr 2023 at 23:44, Thorsten Glaser <tg@evolvis.org> wrote: > > On Tue, 11 Apr 2023, Federico wrote: > > >I was under the impression that when using INSERT SELECT ORDER BY the sequence > >ids were generated using the select order. > > But someone said that’s not guaranteed, especially when INSERT will > be parallelised later. It was Tom Lane's message that said > Re-reading that 2012 thread, the main new observation I'd make today > is that parallel operation is a thing now, and it's not hard to foresee > that sometime soon we'll want to parallelize INSERTs. Which'd make it > *really* hard to promise anything about the order of RETURNING output. My reading of it is that we are talking about RETURNING, not about the order in which the serial ids are generated. My understanding was that they are generated in select order, then the rows are inserted in any arbitrary order the planner may choose and returned again in any arbitrary order. If my understanding is incorrect, would this alternative guarantee the above (that nextval is called in the order set by ORDER BY), again re-using the table in the original message? INSERT INTO t(id, data) SELECT nextval(pg_get_serial_sequence('t', 'id')) data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, num) ORDER BY num RETURNING id best, Federico > bye, > //mirabilos > -- > 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-) > > > >
On Tue, 11 Apr 2023 at 23:46, Rob Sargent <robjsargent@gmail.com> wrote: > > > >> Can your client retain a hashmap of md5,data pairings, allowing the > >> lookup on the way back using the returned data and supplied id? > >> > > When using unique columns or similar, that's something that is done, > > but if there are no unique columns in the value no match can be done > > reliably with the source data, since sqlalchemy is a library that > > allows arbitrary schemas to be generated. > > > > Thanks for the reply, > > Federico > So you're returned data is not what was sent to the server? Otherwise it > should generate the same md5, as I understand it. Identical data would > of course be a problem. > That should be the case, yes. If a table has a non-nullable unique key, it should be possible to use a hashmap and perform that lockup. We are planning on implementing something like this to cover the cases where it can be used. Thanks for the reply, Federico
On Tue, 11 Apr 2023, Federico wrote: >My understanding was that they are generated in select order But are they? (I don’t know, but I’d not assume that.) >If my understanding is incorrect, would this alternative guarantee the above > INSERT INTO t(id, data) > SELECT nextval(pg_get_serial_sequence('t', 'id')) data > FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, num) > ORDER BY num > RETURNING id Wouldn’t, at that point, it be better to just send multiple individual INSERT statements? The overhead (on both sides) for all mentioned… workarounds… surely is larger than that? bye, //mirabilos -- 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
I'm not getting every reply in the list but I want to point this out from the archive version of this thread:
> > I was under the impression that when using INSERT SELECT ORDER BY the sequence
> > ids were generated using the select order.
> But someone said that’s not guaranteed, especially when INSERT will
> be parallelised later.
this should not matter. as in my other message if INSERT INTO table SELECT a, b FROM (VALUES () () ()) ORDER BY... takes the SERIAL or IDENTITY column of the table, and creates a parse tree for that looks like INSERT INTO table (SELECT nextval(table.id), a, b FROM (SELECT a, b FROM (VALUES () () ()) ORDER BY), the INSERT can put the rows in any way it wants. We don't care what INSERT does, we care about the generated sequence value, the nextval(table.id) part, which the SELECT should be emitting in order and occurs outside of the purview of the INSERT, according to other emails I have read on these lists (see my previous post).
On Tue, Apr 11, 2023, at 5:07 PM, Federico wrote:
On Tue, 11 Apr 2023 at 22:59, Thorsten Glaser <tg@evolvis.org> wrote:>> On Tue, 11 Apr 2023, Federico wrote:>> >The problem here is not having the auto increment id in a particular>> The id might not even be auto-increment but UUID or something…> (I am surprised you would even try to insert multiple rows at once.)Well the documentation makes no mention of any limitation on returningand the observed behaviour has consistently been that returning is invalues order.Again, that was SQLAlchemy's fault for assuming this (but the docssurely did not help).Also re-reading my reply, I've made a typo there, sorry. What itshould have read is:The problem here is not having the returned ids in a particularorder, is that there is apparently no correlation with the position ofan element in the values clause with the id generated.Of course sorting the returned ids is only viable when using a serialor identity column, that's why in the general case I've mentioned theinsert with sentinel column to ask if there are better or alternativesolutions.Thanks for the reply, bestFederico>> bye,> //mirabilos> --> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)>>>>
On Wed, 12 Apr 2023 at 11:46, Thorsten Glaser <tg@evolvis.org> wrote: > > On Tue, 11 Apr 2023, Federico wrote: > > >My understanding was that they are generated in select order > > But are they? (I don’t know, but I’d not assume that.) That's kind of the point for this question, to see if that's correct or not. > >If my understanding is incorrect, would this alternative guarantee the above > > > INSERT INTO t(id, data) > > SELECT nextval(pg_get_serial_sequence('t', 'id')) data > > FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, num) > > ORDER BY num > > RETURNING id > > Wouldn’t, at that point, it be better to just send multiple > individual INSERT statements? The overhead (on both sides) > for all mentioned… workarounds… surely is larger than that? No, not by a long shot. Sending thousands of single inserts sequentially over the network requires a lot more time even when doing that on localhost. Using a single statement is many times faster. Federico > bye, > //mirabilos > -- > 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-) > > > >
Hi All,
The fundamental question here is:
What does "insert into object(col1, ... , coln) select sel1, ... , seln from ... order by ord1, ... ordm returning val1, ..., valk" mean?
In particular, what does the "order by" do in "insert ... select ... order by"? Will this be honoured with parallel inserts? Is that really too much to ask?
Is a DBMS just a black box archive of data (throw it in and forget about it)? In my opinion, it's a system for the consistent and efficient storage and retrieval of data, to be used in the context of other software systems. Programmatically, that means matching up what's in the database with what's stored in memory.
And yes, a DBMS needs to give guarantees! ACID compliance is one such guarantee, but it is not the only one required.
Finally, please don't read this badly. It is meant purely to question where we want to go, I strongly support PostgreSQL and believe it has made many major contributions to the development of DBMS over the years.
Many thanks,
John.
On Wed, 12 Apr 2023 at 11:10, Federico <cfederico87@gmail.com> wrote:
On Wed, 12 Apr 2023 at 11:46, Thorsten Glaser <tg@evolvis.org> wrote:
>
> On Tue, 11 Apr 2023, Federico wrote:
>
> >My understanding was that they are generated in select order
>
> But are they? (I don’t know, but I’d not assume that.)
That's kind of the point for this question, to see if that's correct or not.
> >If my understanding is incorrect, would this alternative guarantee the above
>
> > INSERT INTO t(id, data)
> > SELECT nextval(pg_get_serial_sequence('t', 'id')) data
> > FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, num)
> > ORDER BY num
> > RETURNING id
>
> Wouldn’t, at that point, it be better to just send multiple
> individual INSERT statements? The overhead (on both sides)
> for all mentioned… workarounds… surely is larger than that?
No, not by a long shot. Sending thousands of single inserts
sequentially over the network requires a lot more time even when doing
that on localhost.
Using a single statement is many times faster.
Federico
> bye,
> //mirabilos
> --
> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
>
>
>
>
On Tue, Apr 11, 2023 at 4:38 PM Federico <cfederico87@gmail.com> wrote:
Thanks for the ansers
> 2) What would you order by, id or data or both?
by values order, (that incidentally seems to be what PG does)
> with i as (INSERT INTO t(data) VALUES ('a', 1), ('b', 2), ('c', 3)
> returning id)
> select i.id from i order by id;
The problem here is not having the auto increment id in a particular
order, is that there
is apparently no correlation with the position of an element in the
values clause with the
id generated. That's the reason for using the sentinel column in the
general solution in the previous message.
The extend on the use case, SQLAlchemy has 3 objects T that have
T(data='a'), T(data='b'), T(data='c') but no
value for the id column. The objective is to insert the 3 data values,
get back the ids and correctly match them with
the correct 3 objects.
> No. Sadly, adding that ORDER BY is just voodoo programming, because
> it applies to the result of the SELECT while promising nothing about
> the order in which INSERT/RETURNING will act on those rows.
I wasn't probably clear, it's fine if INSERT/RETURNING order is
arbitrary, what matters is that the
autoincementing values is executed in the same order as select, like
mentioned in this
previous message
https://www.postgresql.org/message-id/29386.1528813619%40sss.pgh.pa.us
Is that not the case?
> Re-reading that 2012 thread, the main new observation I'd make today
> is that parallel operation is a thing now, and it's not hard to foresee
> that sometime soon we'll want to parallelize INSERTs. Which'd make it
> *really* hard to promise anything about the order of RETURNING output.
I think it's fine not promising anything about the order of RETURNING, but
it would be very helpful having a way of tracking what input row
generated a particular
output row. Basically the sentinel case in the original post,
without actually having to insert the sentinel into the table.
> I think if you want to use RETURNING with multi-row inserts, the
> thing to do is more like
>
> INSERT INTO t(data) VALUES ('a'), ('b'), ('c') RETURNING data, id
>
> and then explicitly match up the returned "data" values rather than
> presuming they appear in the same order you wrote them in in VALUES.
> Admittedly this might be problematic if some of the VALUES rows
> are identical, but how much should you care?
Well, the example is very easy, but it's hard to generalize when
inserting multiple columns
with possible complex values in them, since it would mean matching on
possibly large json values,
arrays, etc. So definitely not ideal
Thanks,
Federico
On Tue, 11 Apr 2023 at 22:06, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
> On 4/11/23 12:47, Federico wrote:
> > Hello list,
> >
> https://www.sqlite.org/lang_returning.html#limitations_and_caveats
> >
> > Searching the archive seems that a using the INSERT SELECT ORDER BY
> > form should be a better solution,
> > so the above insert should be rewritten as
> >
> > INSERT INTO t(data)
> > SELECT data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data,
> > num) ORDER BY num
> > RETURNING id
>
> Or
>
> with i as (INSERT INTO t(data) VALUES ('a', 1), ('b', 2), ('c', 3)
> returning id)
> select i.id from i order by id;
>
> > Sorry for the long email,
> > Thanks
> >
> > Federico
> >
> >
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com
>
A couple of comments. For the more generic, I prefer RETURNING *
you get back all the columns for matching. To me, this solves the problem in a very generic way.
But SQL (and SET THEORY) basically imply you cannot trust the sequencing of a set of transactions. Parallel execution is just a great simple example.
Secondarily, many frameworks I've worked with (and custom ones developed) would actually call the SEQUENCE.NEXTVAL, and assign the IDs, in memory, accepting that we would have gaping holes if some transactions were never actually sent to the server. We did this a lot in master-detail GUI type stuff. It's just easier. The children knew their parent ID, and all the children ID's were effectively known before committing. It made for simple code that never failed.
(for large datasets we would want one query that returned a set of IDs, we could order that. And apply it to the records we were about to insert). [Be Careful with GENERATED ALWAYS pks to OVERRIDE]
you get back all the columns for matching. To me, this solves the problem in a very generic way.
But SQL (and SET THEORY) basically imply you cannot trust the sequencing of a set of transactions. Parallel execution is just a great simple example.
Secondarily, many frameworks I've worked with (and custom ones developed) would actually call the SEQUENCE.NEXTVAL, and assign the IDs, in memory, accepting that we would have gaping holes if some transactions were never actually sent to the server. We did this a lot in master-detail GUI type stuff. It's just easier. The children knew their parent ID, and all the children ID's were effectively known before committing. It made for simple code that never failed.
(for large datasets we would want one query that returned a set of IDs, we could order that. And apply it to the records we were about to insert). [Be Careful with GENERATED ALWAYS pks to OVERRIDE]
HTH
On 4/12/23 2:35 PM, Kirk Wolak wrote: > On Tue, Apr 11, 2023 at 4:38 PM Federico <cfederico87@gmail.com > > A couple of comments. For the more generic, I prefer RETURNING * > you get back all the columns for matching. To me, this solves the > problem in a very generic way. From what I gather from the conversation RETURNING is the red herring. The request is that for: INSERT INTO some_table(char_fld) VALUES('a'), ('b'), ('c') where some_table has an auto increment field that the values created for said field will always be done in the order that VALUES data was presented so: SELECT id, char_fld from some_table will always return: (1, 'a') (2, 'b') (3, 'c') > But SQL (and SET THEORY) basically imply you cannot trust the sequencing > of a set of transactions. Parallel execution is just a great simple > example. > > Secondarily, many frameworks I've worked with (and custom ones > developed) would actually call the SEQUENCE.NEXTVAL, and assign the IDs, > in memory, accepting that we would have gaping holes if some > transactions were never actually sent to the server. We did this a lot > in master-detail GUI type stuff. It's just easier. The children knew > their parent ID, and all the children ID's were effectively known before > committing. It made for simple code that never failed. > (for large datasets we would want one query that returned a set of IDs, > we could order that. And apply it to the records we were about to > insert). [Be Careful with GENERATED ALWAYS pks to OVERRIDE] > > HTH -- Adrian Klaver adrian.klaver@aklaver.com
On Wed, 12 Apr 2023 at 23:49, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > > On 4/12/23 2:35 PM, Kirk Wolak wrote: > > On Tue, Apr 11, 2023 at 4:38 PM Federico <cfederico87@gmail.com > > > > > A couple of comments. For the more generic, I prefer RETURNING * > > you get back all the columns for matching. To me, this solves the > > problem in a very generic way. > > From what I gather from the conversation RETURNING is the red herring. > > The request is that for: > > INSERT INTO some_table(char_fld) VALUES('a'), ('b'), ('c') > > where some_table has an auto increment field that the values created for > said field will always be done in the order that VALUES data was > presented so: > > SELECT id, char_fld from some_table will always return: > > (1, 'a') > (2, 'b') > (3, 'c') It's actually for an insert query like this INSERT INTO t(data) SELECT data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, num) ORDER BY num Or even (but the above would be nicer) INSERT INTO t(id, data) SELECT nextval(pg_get_serial_sequence('t', 'id')) data FROM (VALUES ('a', 1), ('b', 2), ('c', 3)) as vv(data, num) ORDER BY num Federico > > > But SQL (and SET THEORY) basically imply you cannot trust the sequencing > > of a set of transactions. Parallel execution is just a great simple > > example. > > > > Secondarily, many frameworks I've worked with (and custom ones > > developed) would actually call the SEQUENCE.NEXTVAL, and assign the IDs, > > in memory, accepting that we would have gaping holes if some > > transactions were never actually sent to the server. We did this a lot > > in master-detail GUI type stuff. It's just easier. The children knew > > their parent ID, and all the children ID's were effectively known before > > committing. It made for simple code that never failed. > > (for large datasets we would want one query that returned a set of IDs, > > we could order that. And apply it to the records we were about to > > insert). [Be Careful with GENERATED ALWAYS pks to OVERRIDE] > > > > HTH > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On Wed, Apr 12, 2023 at 5:49 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 4/12/23 2:35 PM, Kirk Wolak wrote:
> On Tue, Apr 11, 2023 at 4:38 PM Federico <cfederico87@gmail.com
>
> A couple of comments. For the more generic, I prefer RETURNING *
> you get back all the columns for matching. To me, this solves the
> problem in a very generic way.
From what I gather from the conversation RETURNING is the red herring.
The request is that for:
INSERT INTO some_table(char_fld) VALUES('a'), ('b'), ('c')
where some_table has an auto increment field that the values created for
said field will always be done in the order that VALUES data was
presented so:
SELECT id, char_fld from some_table will always return:
(1, 'a')
(2, 'b')
(3, 'c')
The solution exists. Pre-fetch the IDs, assign them and insert them with the IDs. Then you have 100% control.
SELECT NEXTVAL('tbl_seq') from GENERATE_SERIES(1, <total_needed>);
// Update your structure, then insert, using these values. SINCE the intention is to update your structure anyways.
// This simply changes the order of operation and requires nothing to work in many environments
Or, with RETURNING *, assign them into your structure based on how the system assigned the IDs
Clearly this is harder than the first suggestion. But it works, without changing anything.
But I find the recommendation to make a DB adhere to ordering "non-ordered" sets, especially when, as stated,
it would not allow for parallelism. I would much rather have parallelism in my INSERTs than some arbitrary commitment
that the slew of data I throw at the DB be processed in an order for some "edge case" that really doesn't simplify the coding.
> But SQL (and SET THEORY) basically imply you cannot trust the sequencing
> of a setoftransactions. Parallel execution is just a great simple
> example.
>
> Secondarily, many frameworks I've worked with (and custom ones
> developed) would actually call the SEQUENCE.NEXTVAL, and assign the IDs,
> in memory, accepting that we would have gaping holes if some
> transactions were never actually sent to the server. We did this a lot
> in master-detail GUI type stuff. It's just easier. The children knew
> their parent ID, and all the children ID's were effectively known before
> committing. It made for simple code that never failed.
> (for large datasets we would want one query that returned a set of IDs,
> we could order that. And apply it to the records we were about to
> insert). [Be Careful with GENERATED ALWAYS pks to OVERRIDE]
>
> HTH
--
Adrian Klaver
adrian.klaver@aklaver.com
On Wed, 12 Apr 2023, Kirk Wolak wrote: >The solution exists. Pre-fetch the IDs, assign them and insert them with >the IDs. Then you have 100% control. >SELECT NEXTVAL('tbl_seq') from GENERATE_SERIES(1, <total_needed>); This would be a nice solution… but the PK could be always generated, or not even sequential (UUIDs for example). If you’re developing schema-first the application would not even (need to) know about how the IDs are generated as it’s the DB that generates them. AIUI the OP’s an SQLAlchemy developer, which is an ORM. Therefore he needs a generic solution. bye, //mirabilos -- 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
> This would be a nice solution… but the PK could be always generated, > or not even sequential (UUIDs for example). If you’re developing > schema-first the application would not even (need to) know about how > the IDs are generated as it’s the DB that generates them. > > AIUI the OP’s an SQLAlchemy developer, which is an ORM. Therefore he > needs a generic solution. Or a different dev env. :)
On Wed, Apr 12, 2023, at 6:24 PM, Thorsten Glaser wrote:
On Wed, 12 Apr 2023, Kirk Wolak wrote:>The solution exists. Pre-fetch the IDs, assign them and insert them with>the IDs. Then you have 100% control.>SELECT NEXTVAL('tbl_seq') from GENERATE_SERIES(1, <total_needed>);This would be a nice solution… but the PK could be always generated,or not even sequential (UUIDs for example). If you’re developingschema-first the application would not even (need to) know about howthe IDs are generated as it’s the DB that generates them.AIUI the OP’s an SQLAlchemy developer, which is an ORM. Therefore heneeds a generic solution.
im my view I already have the solution which is the form I referred to in my earlier email:
INSERT INTO mytable (a, b)
SELECT p1, p2 FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num) ORDER BY num
RETURNING mytable.id
the columns we want to INSERT, a and b, come from p1 and p2 in the SELECT. the third value in each table-row value inside the VALUES clause we call the "sentinel" and we apply it as an incrementing number which we also ORDER BY.
per Tom Lane's email at https://www.postgresql.org/message-id/29386.1528813619%40sss.pgh.pa.us in 2018, assuming mytable.id is serial (or I would assume IDENTITY), this resolves to something equivalent to:
INSERT INTO mytable (id, a, b)
SELECT nextval("mytable_id_seq"), p1, p2 FROM (
SELECT p1, p2 FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num) ORDER BY num
) AS ss RETURNING mytable.id
So the above is one part we want to confirm is accurate, hasn't changed, is not going to change much as far as its ultimate behavior. It means that as rows are generated from the SELECT, the ascending values for "mytable_id_seq" will correspond to the ascending values for "num". That's the important part; we are sending in an number that controls the sorting of records as they are delivered outwards, and the generated identifiers, assuming they are incrementing, will sort in that same way as each value is associated with the rows emanating from the inner SELECT. it does not matter if the sequence has gaps in it from other transactions or whatever, only that it increments in the same direction as "num" which we pass in.
In this sequence of events, the INSERT conceptually hasn't even happened for a particular row before the row has been associated with a sequence value. INSERT can do whatever it wants, it can break out into 20 threads and jumble up all the rows or whatever. When it sends them out in RETURNING, whatever order it comes back is no problem; we just want those "mytable_id_seq" values back. We sort by that when we get the rows back. This ordering matches the one we put in. So the way these threads often seem to discuss how INSERTS and RETURNING are both not ordered are missing this particular point, that we really just need a way to relate server generated values to the rows we put in.
This form comes from two places, one is from Microsoft SQL Server which documents as well as guaranteed to maintain order at https://learn.microsoft.com/en-us/sql/t-sql/statements/insert-transact-sql?view=sql-server-ver16; "INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted." ; for PostgreSQL, the 2018 thread above seems to confirm a similar behavior for PostgreSQL.
So we were looking for feedback from PG developers on the above assumptions. Beyond that, throughout all the theoretical talk about "well PG might change this someday", while PG most certainly can and maybe will change some of these assumptions, it would produce a lot of surprises for many users who knowingly or unknowingly have relied on various orderings of this way and it would likely lead to some very good documentation about this area of functionality, maybe even some new syntaxes or keywords to help with this extremely common use case. Documentation for these things is what the various threads on this subject often ask for. The first such thread I can find asking about this, which also ended kind of ambiguously, was in 2012 here: https://www.mail-archive.com/pgsql-hackers@postgresql.org/msg204156.html . Then there's the 2018 thread, and also a bunch on stackoverflow. Everyone is left guessing or not totally sure since there's no official documentation for this.
We're going to document in SQLAlchemy the approaches we are using for different backends and I wouldnt be surprised if when people google for "PostgreSQL INSERT RETURNING ORDERING" our own docs are going to come up near the top. That seems to happen a lot.
On Wed, 12 Apr 2023, Mike Bayer wrote: >ascending values for "mytable_id_seq" will correspond to the ascending >values for "num". But, again, what if it’s uuid or something instead of a sequence? bye, //mirabilos -- 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
On 4/12/23 17:58, Thorsten Glaser wrote: > On Wed, 12 Apr 2023, Mike Bayer wrote: > >> ascending values for "mytable_id_seq" will correspond to the ascending >> values for "num". > > But, again, what if it’s uuid or something instead of a sequence? This for an ORM that will be using a sequence. > > bye, > //mirabilos -- Adrian Klaver adrian.klaver@aklaver.com
We do something different for uuids. These are usually created client side anyway or are very easy to create client sideif not and then we also have a separate sentinel column option. Overall the whole thing is an optimization that can be turned off for uncommon cases. We just can't go back to having insertof 1000 rows be 3-5x slower for all tables w server generated primary keys which is what statement at a time gets us. Surrogate integer Pks w sequence or identity is the big case. On Wed, Apr 12, 2023, at 8:58 PM, Thorsten Glaser wrote: > On Wed, 12 Apr 2023, Mike Bayer wrote: > >>ascending values for "mytable_id_seq" will correspond to the ascending >>values for "num". > > But, again, what if it’s uuid or something instead of a sequence? > > bye, > //mirabilos > -- > 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
The ideal solution would be that the RETURNING values are in the same order as the declared rows in the table literal "FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num)". But before any of you jump down my throat about about sets and a bar to parallel inserts, I think it's a small matter of some internal smoke and mirrors: simply marshal the RETURNING results through and in memory/temporary table maintaining the order from the SELECT and return these to the client upon commit. At this moment, I believe the folk at SQLAlchemy are looking for a work around that can work for most use cases with either an identity column or a fake "sentinel" column for matching back to the ORM objects. There may be an upper limit of 1000 to the number rows in a single insert (this arises in MS SQL Server, who absolutely do not guarantee the order of their OUTPUT clause -- it's tied to the physical row insert rather than marshalled).
My feeling is that PostgreSQL may already do this. See the commit:
and the original message thread
and references to undesired behaviour prior to PostgreSQL 9.6 such as in https://www.postgresql.org/docs/current/sql-select.html.
The test cases in the above commit use:
SELECT nextval('sequence'), datavals FROM ... ORDER BY something;
type syntax. And I presume that's exactly what's happening internally when there's a serial identity column (via query rewrite rules or another mechanism).
So really, that just needs confirming. What happens with triggers? Presumably the same as long as query rewrite rules are used, but this is presumably getting deeper into the code for actual inserts after the initial select. The jump to the output being ordered, is just a guess but there's some bits and pieces that seem to suggest that there may indeed be a marshalling process going on in the background (whether or not that is linked to the original order is another matter).
I have set up a PostgreSQL server to test if I can break this hypothesis and see what query explains can allude to. Does anyone have a test case where the order is not preserved?
Might I also point out that if the implementation of parallel inserts does create a bar then doing so may end up with the programmatic interfaces (such as SQLAlchemy) not being able to use that feature (possibly reverting back to single inserts). Ur, so what would be the problem being solved with parallel inserts?
On Thu, 13 Apr 2023 at 02:27, Mike Bayer <mike_mp@zzzcomputing.com> wrote:
We do something different for uuids. These are usually created client side anyway or are very easy to create client side if not and then we also have a separate sentinel column option.
Overall the whole thing is an optimization that can be turned off for uncommon cases. We just can't go back to having insert of 1000 rows be 3-5x slower for all tables w server generated primary keys which is what statement at a time gets us. Surrogate integer Pks w sequence or identity is the big case.
On Wed, Apr 12, 2023, at 8:58 PM, Thorsten Glaser wrote:
> On Wed, 12 Apr 2023, Mike Bayer wrote:
>
>>ascending values for "mytable_id_seq" will correspond to the ascending
>>values for "num".
>
> But, again, what if it’s uuid or something instead of a sequence?
>
> bye,
> //mirabilos
> --
> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
To clarify, the work around (for SQLAlchemy) does not require the RETURNING values to be ordered; they will be reordered code side. Longer term matching the order up would be an added bonus!
On Thu, 13 Apr 2023 at 04:02, John Howroyd <jdhowroyd@googlemail.com> wrote:
The ideal solution would be that the RETURNING values are in the same order as the declared rows in the table literal "FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num)". But before any of you jump down my throat about about sets and a bar to parallel inserts, I think it's a small matter of some internal smoke and mirrors: simply marshal the RETURNING results through and in memory/temporary table maintaining the order from the SELECT and return these to the client upon commit. At this moment, I believe the folk at SQLAlchemy are looking for a work around that can work for most use cases with either an identity column or a fake "sentinel" column for matching back to the ORM objects. There may be an upper limit of 1000 to the number rows in a single insert (this arises in MS SQL Server, who absolutely do not guarantee the order of their OUTPUT clause -- it's tied to the physical row insert rather than marshalled).My feeling is that PostgreSQL may already do this. See the commit:and the original message threadand references to undesired behaviour prior to PostgreSQL 9.6 such as in https://www.postgresql.org/docs/current/sql-select.html.The test cases in the above commit use:SELECT nextval('sequence'), datavals FROM ... ORDER BY something;type syntax. And I presume that's exactly what's happening internally when there's a serial identity column (via query rewrite rules or another mechanism).So really, that just needs confirming. What happens with triggers? Presumably the same as long as query rewrite rules are used, but this is presumably getting deeper into the code for actual inserts after the initial select. The jump to the output being ordered, is just a guess but there's some bits and pieces that seem to suggest that there may indeed be a marshalling process going on in the background (whether or not that is linked to the original order is another matter).I have set up a PostgreSQL server to test if I can break this hypothesis and see what query explains can allude to. Does anyone have a test case where the order is not preserved?Might I also point out that if the implementation of parallel inserts does create a bar then doing so may end up with the programmatic interfaces (such as SQLAlchemy) not being able to use that feature (possibly reverting back to single inserts). Ur, so what would be the problem being solved with parallel inserts?On Thu, 13 Apr 2023 at 02:27, Mike Bayer <mike_mp@zzzcomputing.com> wrote:We do something different for uuids. These are usually created client side anyway or are very easy to create client side if not and then we also have a separate sentinel column option.
Overall the whole thing is an optimization that can be turned off for uncommon cases. We just can't go back to having insert of 1000 rows be 3-5x slower for all tables w server generated primary keys which is what statement at a time gets us. Surrogate integer Pks w sequence or identity is the big case.
On Wed, Apr 12, 2023, at 8:58 PM, Thorsten Glaser wrote:
> On Wed, 12 Apr 2023, Mike Bayer wrote:
>
>>ascending values for "mytable_id_seq" will correspond to the ascending
>>values for "num".
>
> But, again, what if it’s uuid or something instead of a sequence?
>
> bye,
> //mirabilos
> --
> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
here's what would be cool, a new token called "tuple_order" or something
INSERT INTO table (a, b, c) VALUES ((1, 2, 3), (4, 5, 6), ...) RETURNING table.id, inserted.tuple_order
tuple_order would be incrementing values 1, 2, 3, 4, 5, ... which correlate the each row delivered by RETURNING to each entry in the VALUES clause, in the order they were stated in that VALUES clause, that is entry (1, 2, 3) would be tuple_order 1, entry (4, 5, 6) would be tuple order 2, etc.
That token would solve the entire problem in all cases. The 1800 line changeset I've been working on all week would go away (if every database had this) and the change would be reduced to a few dozen lines.
On Wed, Apr 12, 2023, at 11:05 PM, John Howroyd wrote:
To clarify, the work around (for SQLAlchemy) does not require the RETURNING values to be ordered; they will be reordered code side. Longer term matching the order up would be an added bonus!On Thu, 13 Apr 2023 at 04:02, John Howroyd <jdhowroyd@googlemail.com> wrote:The ideal solution would be that the RETURNING values are in the same order as the declared rows in the table literal "FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num)". But before any of you jump down my throat about about sets and a bar to parallel inserts, I think it's a small matter of some internal smoke and mirrors: simply marshal the RETURNING results through and in memory/temporary table maintaining the order from the SELECT and return these to the client upon commit. At this moment, I believe the folk at SQLAlchemy are looking for a work around that can work for most use cases with either an identity column or a fake "sentinel" column for matching back to the ORM objects. There may be an upper limit of 1000 to the number rows in a single insert (this arises in MS SQL Server, who absolutely do not guarantee the order of their OUTPUT clause -- it's tied to the physical row insert rather than marshalled).My feeling is that PostgreSQL may already do this. See the commit:and the original message threadand references to undesired behaviour prior to PostgreSQL 9.6 such as in https://www.postgresql.org/docs/current/sql-select.html.The test cases in the above commit use:SELECT nextval('sequence'), datavals FROM ... ORDER BY something;type syntax. And I presume that's exactly what's happening internally when there's a serial identity column (via query rewrite rules or another mechanism).So really, that just needs confirming. What happens with triggers? Presumably the same as long as query rewrite rules are used, but this is presumably getting deeper into the code for actual inserts after the initial select. The jump to the output being ordered, is just a guess but there's some bits and pieces that seem to suggest that there may indeed be a marshalling process going on in the background (whether or not that is linked to the original order is another matter).I have set up a PostgreSQL server to test if I can break this hypothesis and see what query explains can allude to. Does anyone have a test case where the order is not preserved?Might I also point out that if the implementation of parallel inserts does create a bar then doing so may end up with the programmatic interfaces (such as SQLAlchemy) not being able to use that feature (possibly reverting back to single inserts). Ur, so what would be the problem being solved with parallel inserts?On Thu, 13 Apr 2023 at 02:27, Mike Bayer <mike_mp@zzzcomputing.com> wrote:We do something different for uuids. These are usually created client side anyway or are very easy to create client side if not and then we also have a separate sentinel column option.Overall the whole thing is an optimization that can be turned off for uncommon cases. We just can't go back to having insert of 1000 rows be 3-5x slower for all tables w server generated primary keys which is what statement at a time gets us. Surrogate integer Pks w sequence or identity is the big case.On Wed, Apr 12, 2023, at 8:58 PM, Thorsten Glaser wrote:> On Wed, 12 Apr 2023, Mike Bayer wrote:>>>ascending values for "mytable_id_seq" will correspond to the ascending>>values for "num".>> But, again, what if it’s uuid or something instead of a sequence?>> bye,> //mirabilos> --> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
A few more thoughts.
My " simply marshal the RETURNING results through and in memory/temporary table maintaining the order from the SELECT and return these to the client upon commit" isn't quite right with the "upon commit": that (the commit) may happen much later with several statements wrapped into a transaction. So very much an internal decision about when those results happen to be returned. The sentiment is that the returning results are emitted as a last step in the processing of the statement / execution plan. Future statements in a multi-statement transaction may depend on the output.
However, the marshalling is presumably already happening (except any tie to ordering to the original declared tuples); otherwise I really don't understand how the with syntax (I think first suggested by Adrian Klaver) would work. @SQLAlchemy, if you wanted to off load reordering to database side a combination solution might be considered:
with rslt as (
INSERT INTO mytable (a, b)
SELECT p1, p2 FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num) ORDER BY num
RETURNING mytable.id, mytable.other_auto_gen
SELECT p1, p2 FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num) ORDER BY num
RETURNING mytable.id, mytable.other_auto_gen
)
select * from rslt
order by id
;
should work (not yet tested); and could be modified to the "sentinel" columns. Where the primary key is explicitly in the inserted columns (and you are not using an identity) then explicit matching would need to occur unless you are given an order guarantee. As for randomly generated GUID/UUID, I don't think anything will work other than a guarantee on order or a surrogate "sentinel" column/explicit tuple id.
@Mike Bayer: Isn't the order guarantee sufficient (wasn't that the original assumption in the 2.0 code)? I do quite understand the wish for having one solution that fits all without dialectic dependencies. However, in my opinion, this one is going to run for some time and is heavily dependent on internals. With mssql, this seems to me to be deep down in the insert internals and unlikely to change any time soon (at least until they start losing market share because other DBMSs do it better). Here (PostgreSQL) the actual mechanisms required can probably be achieved much more readily (given the will). But the fundamental issue of matching code side objects with database records (with side effects) remains and is only going to become more prominent. @PostgreSQL, isn't this your opportunity to lead the way again!-).
On Thu, 13 Apr 2023 at 05:26, Mike Bayer <mike_mp@zzzcomputing.com> wrote:
here's what would be cool, a new token called "tuple_order" or somethingINSERT INTO table (a, b, c) VALUES ((1, 2, 3), (4, 5, 6), ...) RETURNING table.id, inserted.tuple_ordertuple_order would be incrementing values 1, 2, 3, 4, 5, ... which correlate the each row delivered by RETURNING to each entry in the VALUES clause, in the order they were stated in that VALUES clause, that is entry (1, 2, 3) would be tuple_order 1, entry (4, 5, 6) would be tuple order 2, etc.That token would solve the entire problem in all cases. The 1800 line changeset I've been working on all week would go away (if every database had this) and the change would be reduced to a few dozen lines.On Wed, Apr 12, 2023, at 11:05 PM, John Howroyd wrote:To clarify, the work around (for SQLAlchemy) does not require the RETURNING values to be ordered; they will be reordered code side. Longer term matching the order up would be an added bonus!On Thu, 13 Apr 2023 at 04:02, John Howroyd <jdhowroyd@googlemail.com> wrote:The ideal solution would be that the RETURNING values are in the same order as the declared rows in the table literal "FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num)". But before any of you jump down my throat about about sets and a bar to parallel inserts, I think it's a small matter of some internal smoke and mirrors: simply marshal the RETURNING results through and in memory/temporary table maintaining the order from the SELECT and return these to the client upon commit. At this moment, I believe the folk at SQLAlchemy are looking for a work around that can work for most use cases with either an identity column or a fake "sentinel" column for matching back to the ORM objects. There may be an upper limit of 1000 to the number rows in a single insert (this arises in MS SQL Server, who absolutely do not guarantee the order of their OUTPUT clause -- it's tied to the physical row insert rather than marshalled).My feeling is that PostgreSQL may already do this. See the commit:and the original message threadand references to undesired behaviour prior to PostgreSQL 9.6 such as in https://www.postgresql.org/docs/current/sql-select.html.The test cases in the above commit use:SELECT nextval('sequence'), datavals FROM ... ORDER BY something;type syntax. And I presume that's exactly what's happening internally when there's a serial identity column (via query rewrite rules or another mechanism).So really, that just needs confirming. What happens with triggers? Presumably the same as long as query rewrite rules are used, but this is presumably getting deeper into the code for actual inserts after the initial select. The jump to the output being ordered, is just a guess but there's some bits and pieces that seem to suggest that there may indeed be a marshalling process going on in the background (whether or not that is linked to the original order is another matter).I have set up a PostgreSQL server to test if I can break this hypothesis and see what query explains can allude to. Does anyone have a test case where the order is not preserved?Might I also point out that if the implementation of parallel inserts does create a bar then doing so may end up with the programmatic interfaces (such as SQLAlchemy) not being able to use that feature (possibly reverting back to single inserts). Ur, so what would be the problem being solved with parallel inserts?On Thu, 13 Apr 2023 at 02:27, Mike Bayer <mike_mp@zzzcomputing.com> wrote:We do something different for uuids. These are usually created client side anyway or are very easy to create client side if not and then we also have a separate sentinel column option.Overall the whole thing is an optimization that can be turned off for uncommon cases. We just can't go back to having insert of 1000 rows be 3-5x slower for all tables w server generated primary keys which is what statement at a time gets us. Surrogate integer Pks w sequence or identity is the big case.On Wed, Apr 12, 2023, at 8:58 PM, Thorsten Glaser wrote:> On Wed, 12 Apr 2023, Mike Bayer wrote:>>>ascending values for "mytable_id_seq" will correspond to the ascending>>values for "num".>> But, again, what if it’s uuid or something instead of a sequence?>> bye,> //mirabilos> --> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
On Thu, Apr 13, 2023, at 8:26 AM, John Howroyd wrote: > > However, the marshalling is presumably already happening (except any tie to ordering to the original declared tuples);otherwise I really don't understand how the with syntax (I think first suggested by Adrian Klaver) would work. @SQLAlchemy,if you wanted to off load reordering to database side a combination solution might be considered: > > with rslt as ( > INSERT INTO mytable (a, b) > SELECT p1, p2 FROM (VALUES (10, 11, 1), (12, 13, 2)) AS sel(p1, p2, num) ORDER BY num > RETURNING mytable.id, mytable.other_auto_gen > ) > select * from rslt > order by id > ; > > should work (not yet tested); and could be modified to the "sentinel" columns. the solution I propose is essentially doing the same SQL as above, except we deliver the RETURNING in any order and do thefinal sort of "id" client side; we are only sending out batches of a few thousand rows at a time. The "order" is notactually what's important, it's that we can relate server generated values to each tuple inside VALUES. > @Mike Bayer: Isn't the order guarantee sufficient (wasn't that the original assumption in the 2.0 code)? I do quite understandthe wish for having one solution that fits all without dialectic dependencies. However, in my opinion, this oneis going to run for some time and is heavily dependent on internals. With mssql, this seems to me to be deep down inthe insert internals and unlikely to change any time soon (at least until they start losing market share because otherDBMSs do it better). Here (PostgreSQL) the actual mechanisms required can probably be achieved much more readily (giventhe will). But the fundamental issue of matching code side objects with database records (with side effects) remainsand is only going to become more prominent. @PostgreSQL, isn't this your opportunity to lead the way again!-). oh like I did for database URLs, right ? :) which we then got in trouble for because we weren't using *pgs* format, whichwas inspired by ours in the first place... (switching mail client to plain text, haven't used old school mailing lists in a long time...)
On 2023-04-11 15:45:59 -0600, Rob Sargent wrote: > > > Can your client retain a hashmap of md5,data pairings, allowing the > > > lookup on the way back using the returned data and supplied id? > > > > > When using unique columns or similar, that's something that is done, > > but if there are no unique columns in the value no match can be done > > reliably with the source data, since sqlalchemy is a library that > > allows arbitrary schemas to be generated. > So you're returned data is not what was sent to the server? Otherwise it > should generate the same md5, as I understand it. That's not necessarily the case. There are quite a few data types where the input value is truncated, rounded or otherwise normalized. So I don't think you can generally expect to read back exactly the same value you inserted. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
@PostgreSQL: Might I ask if this is still being actively considered or should we repost to another mailing list (perhaps pgsql-hackers or any other you might suggest)?
On Fri, Apr 14, 2023 at 11:42 AM John Howroyd <jdhowroyd@googlemail.com> wrote:
@PostgreSQL: Might I ask if this is still being actively considered or should we repost to another mailing list (perhaps pgsql-hackers or any other you might suggest)?
This is the right place for such a discussion. Unless you think you've come up with a design that you want to propose and that, if accepted, you would then write up a patch for. For that you'd probably want to move the discussion to -hackers with an outline of the patch you plan to write.
David J.
On Fri, 14 Apr 2023 at 21:37, David G. Johnston <david.g.johnston@gmail.com> wrote: > > On Fri, Apr 14, 2023 at 11:42 AM John Howroyd <jdhowroyd@googlemail.com> wrote: >> >> @PostgreSQL: Might I ask if this is still being actively considered or should we repost to another mailing list (perhapspgsql-hackers or any other you might suggest)? > > > This is the right place for such a discussion. Unless you think you've come up with a design that you want to proposeand that, if accepted, you would then write up a patch for. For that you'd probably want to move the discussion to-hackers with an outline of the patch you plan to write. > > David J. > Would something like what was proposed by Mike Bayer be considered? > A new token called "tuple_order" or something > > INSERT INTO table (a, b, c) VALUES ((1, 2, 3), (4, 5, 6), ...) RETURNING table.id, inserted.tuple_order > > tuple_order would be incrementing values 1, 2, 3, 4, 5, ... which correlate the each row delivered by RETURNING to eachentry in the VALUES clause, in the order they were stated in that VALUES clause, that is entry (1, 2, 3) would be tuple_order1, entry (4, 5, 6) would be tuple order 2, etc. This would allow easy reordering of the RETURNING clause, either client side or moving the INSERT into a CTE and ordering the outside select. I also don't think it would have any impact on parallel processing of the INSERT, since RETURNING could output rows in any arbitrary order. Best, Federico
A patch for what? All my testing gives me the same output order as the declaration order. Does anyone have an example where this is not the case?
The problem is that SQLAlchemy is an ORM and they need to be sure to match records from the insert to the relevant code side objects. This needs to be efficient as the majority of the python world will use this ORM for their database needs. There is no PostgreSQL documentation upon which to give this assertion.
So what is really needed is for PostgreSQL to commit to this behaviour, or give guidance on how this might be achieved.
On Fri, 14 Apr 2023 at 19:50, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, Apr 14, 2023 at 11:42 AM John Howroyd <jdhowroyd@googlemail.com> wrote:@PostgreSQL: Might I ask if this is still being actively considered or should we repost to another mailing list (perhaps pgsql-hackers or any other you might suggest)?This is the right place for such a discussion. Unless you think you've come up with a design that you want to propose and that, if accepted, you would then write up a patch for. For that you'd probably want to move the discussion to -hackers with an outline of the patch you plan to write.David J.
On Fri, Apr 14, 2023 at 2:44 PM John Howroyd <jdhowroyd@googlemail.com> wrote:
A patch for what? All my testing gives me the same output order as the declaration order. Does anyone have an example where this is not the case?The problem is that SQLAlchemy is an ORM and they need to be sure to match records from the insert to the relevant code side objects. This needs to be efficient as the majority of the python world will use this ORM for their database needs. There is no PostgreSQL documentation upon which to give this assertion.So what is really needed is for PostgreSQL to commit to this behaviour, or give guidance on how this might be achieved.
We aren't committing to it though. It will require newly introduced syntax that tells the system to behave in the manner so desired. It makes sense to attach such syntax to the RETURNING clause itself. Whether someone can achieve that in a manner that the project is willing to commit remains to be seen.
In short, lots of people complain about PostgreSQL's lack of this feature. None of them are stepping up to fill in the gap and there isn't enough interest, for many reasons, for the people being complained to to go ahead and design and implement it themselves.
David J.
On Fri, Apr 14, 2023 at 12:47 PM Federico <cfederico87@gmail.com> wrote:
Would something like what was proposed by Mike Bayer be considered?
> A new token called "tuple_order" or something
>
> INSERT INTO table (a, b, c) VALUES ((1, 2, 3), (4, 5, 6), ...) RETURNING table.id, inserted.tuple_order
>
> tuple_order would be incrementing values 1, 2, 3, 4, 5, ... which correlate the each row delivered by RETURNING to each entry in the VALUES clause, in the order they were stated in that VALUES clause, that is entry (1, 2, 3) would be tuple_order 1, entry (4, 5, 6) would be tuple order 2, etc.
Personally, yes, a declarative syntax specifying the desired behavior and thus giving the system permission to potentially trade optimizations for a deterministic output is the way forward. Ideally current and future optimizations/performance characteristics could be maintained even in the presence of this option but ultimately the system would be compelled to return records in the specified order no matter the cost. For bonus points, the impact on queries that do not specify this option should be as close to nothing as possible - though I/O is likely to make some increase negligible.
David J.
Am Fri, Apr 14, 2023 at 10:44:19PM +0100 schrieb John Howroyd: > The problem is that SQLAlchemy is an ORM [...] ... > [...] as the majority of the python world will use this ORM for > their database needs. I wouldn't be so sure on this count ... Karsten -- GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
On 4/14/23 15:44, Karsten Hilbert wrote: > Am Fri, Apr 14, 2023 at 10:44:19PM +0100 schrieb John Howroyd: > >> The problem is that SQLAlchemy is an ORM [...] > ... >> [...] as the majority of the python world will use this ORM for >> their database needs. > > I wouldn't be so sure on this count ... +1 > > Karsten > -- > GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B > > -- Adrian Klaver adrian.klaver@aklaver.com
Federico <cfederico87@gmail.com> writes: > Would something like what was proposed by Mike Bayer be considered? >> A new token called "tuple_order" or something >> >> INSERT INTO table (a, b, c) VALUES ((1, 2, 3), (4, 5, 6), ...) RETURNING table.id, inserted.tuple_order >> >> tuple_order would be incrementing values 1, 2, 3, 4, 5, ... which correlate the each row delivered by RETURNING to eachentry in the VALUES clause, in the order they were stated in that VALUES clause, that is entry (1, 2, 3) would be tuple_order1, entry (4, 5, 6) would be tuple order 2, etc. As proposed, I don't think so. Something over in the RETURNING clause has exactly no connection to VALUES. What do you do if it's INSERT ... SELECT and there are several VALUES clauses down inside the SELECT? There is some prior art in this area, though. See the more-or-less SQL-standard WITH ORDINALITY option for functions-in-FROM. It seems to me that it could be plausible to attach WITH ORDINALITY to a VALUES clause, which would give you a rock-solid connection between the VALUES rows and the ordinality-column values, and then you could include that column in RETURNING. regards, tom lane
On Fri, Apr 14, 2023 at 8:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Federico <cfederico87@gmail.com> writes:
> Would something like what was proposed by Mike Bayer be considered?
>> A new token called "tuple_order" or something
>>
>> INSERT INTO table (a, b, c) VALUES ((1, 2, 3), (4, 5, 6), ...) RETURNING table.id, inserted.tuple_order
>>
>> tuple_order would be incrementing values 1, 2, 3, 4, 5, ... which correlate the each row delivered by RETURNING to each entry in the VALUES clause, in the order they were stated in that VALUES clause, that is entry (1, 2, 3) would be tuple_order 1, entry (4, 5, 6) would be tuple order 2, etc.
As proposed, I don't think so. Something over in the RETURNING clause has
exactly no connection to VALUES. What do you do if it's INSERT ... SELECT
and there are several VALUES clauses down inside the SELECT?
There is some prior art in this area, though. See the more-or-less
SQL-standard WITH ORDINALITY option for functions-in-FROM. It seems to me
that it could be plausible to attach WITH ORDINALITY to a VALUES clause,
which would give you a rock-solid connection between the VALUES rows and
the ordinality-column values, and then you could include that column in
RETURNING.
I suppose breaking the restriction that only columns present on the insertion-table can be returned is a possible option that also solves another infrequent request.
Adding with ordinality to the values clause seems like an independent feature that has merit on its own for the same reason SRFs have that option.
Ultimately, though, whatever is supplying the rows to the insert is going to have to supply them in some order, and the returning clause can simply modify the executor to perform an initial pass over the input data in order to assign the ordering indexes before allowing for parallelism, and then sort again on the way out. User-space does/should not need to create data for this purpose. A top-level values query should not need to be specified with ordinality while subquery ones would be hidden from the mechanism, the parent query being responsible for any ordering that is desired.
David J.
On Sat, 15 Apr 2023 at 05:17, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Federico <cfederico87@gmail.com> writes: > > Would something like what was proposed by Mike Bayer be considered? > > >> A new token called "tuple_order" or something > >> > >> INSERT INTO table (a, b, c) VALUES ((1, 2, 3), (4, 5, 6), ...) RETURNING table.id, inserted.tuple_order > >> > >> tuple_order would be incrementing values 1, 2, 3, 4, 5, ... which correlate the each row delivered by RETURNING to eachentry in the VALUES clause, in the order they were stated in that VALUES clause, that is entry (1, 2, 3) would be tuple_order1, entry (4, 5, 6) would be tuple order 2, etc. > > As proposed, I don't think so. Something over in the RETURNING clause has > exactly no connection to VALUES. What do you do if it's INSERT ... SELECT > and there are several VALUES clauses down inside the SELECT? I think the tuple order would not be connected to the values, but be determined by the input order of the rows in the insert. So when using INSERT ... SELECT the tuple_order value would be determined by the output of the select, using the same logic as ROW_NUMBER. > There is some prior art in this area, though. See the more-or-less > SQL-standard WITH ORDINALITY option for functions-in-FROM. It seems to me > that it could be plausible to attach WITH ORDINALITY to a VALUES clause, > which would give you a rock-solid connection between the VALUES rows and > the ordinality-column values, and then you could include that column in > RETURNING. Well if returning could output rows that are not in the inserted table, like mentioned also by Devid Johnston, it would open to alternatives to tuple_order, like using INSERT INTO tbl(data) SELECT v.d FROM (VALUES ('x', 1), ('y', 2)) AS v(d, num) RETURNING tbl.id, v.num > regards, tom lane I'm not familiar with the internals of Postgresql, so I don't know what would be more complex to implement in this case. I think both solutions would work equally well from the client point of view. Thanks for the reply, Federico
Federico <cfederico87@gmail.com> writes: > I think the tuple order would not be connected to the values, but be > determined by the input order of the rows in the insert. So when using > INSERT ... SELECT the tuple_order value would be determined by the > output of the select, using the same logic as ROW_NUMBER. We already rejected the idea that INSERT must preserve the order of the incoming tuples. Please don't re-propose it with different wording. regards, tom lane
On Sat, 15 Apr 2023 at 15:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Federico <cfederico87@gmail.com> writes: > > I think the tuple order would not be connected to the values, but be > > determined by the input order of the rows in the insert. So when using > > INSERT ... SELECT the tuple_order value would be determined by the > > output of the select, using the same logic as ROW_NUMBER. > > We already rejected the idea that INSERT must preserve the order of > the incoming tuples. Please don't re-propose it with different > wording. > > regards, tom lane That was not my intention, what I meant is that the hypothetical inserted.tuple_order in case of insert .. select would be determined by the output order of the select. Insert would be free to insert the row as it pleases. Sorry if it wasn't clear. Best Federico
On Fri, Apr 14, 2023, at 11:17 PM, Tom Lane wrote: > Federico <cfederico87@gmail.com> writes: >> Would something like what was proposed by Mike Bayer be considered? > >>> A new token called "tuple_order" or something >>> >>> INSERT INTO table (a, b, c) VALUES ((1, 2, 3), (4, 5, 6), ...) RETURNING table.id, inserted.tuple_order >>> >>> tuple_order would be incrementing values 1, 2, 3, 4, 5, ... which correlate the each row delivered by RETURNING to eachentry in the VALUES clause, in the order they were stated in that VALUES clause, that is entry (1, 2, 3) would be tuple_order1, entry (4, 5, 6) would be tuple order 2, etc. > > As proposed, I don't think so. Something over in the RETURNING clause has > exactly no connection to VALUES. What do you do if it's INSERT ... SELECT > and there are several VALUES clauses down inside the SELECT? in my "plan", the token would not be supported and an error would be raised. > > There is some prior art in this area, though. See the more-or-less > SQL-standard WITH ORDINALITY option for functions-in-FROM. It seems to me > that it could be plausible to attach WITH ORDINALITY to a VALUES clause, > which would give you a rock-solid connection between the VALUES rows and > the ordinality-column values, and then you could include that column in > RETURNING. I appreciate this idea! Any kind of keyword / syntax that frees us from having to round-trip additional data to the databaseand/or generate more complex syntaxes for certain kinds of default generation schemes would simplify the approach.
May I clarify the ideas being discussed so far, perhaps with a view to make a relevant proposal. My apologies if I get anything wrong or go too far.
As I understand it the proposal is to supplement the syntax to something like:
INSERT INTO table (a, b, c)
VALUES ((1,2,3), (4,5,6), ...)
WITH ORDINALITY
RETURNING table.id, ordinality
;
The meaning of which is to adjoin an ordinality column to the output reflecting the declaration order in the values clause. So an output of (not necessarily in any order):
(1001, 1)
(1003, 2)
means that table.id = 1001 was assigned to the inserted row from tuple (1,2,3) (from VALUES, because that table.id is associated to ordinality = 1) and table.id = 1003 was assigned to the inserted row from tuple (4,5,6). The output being ordered as determined by the internals of query execution (not necessarily the one shown).
Is that correct?
I presume (although, not quite so clear) that one would have:
INSERT INTO table (a, b, c)
SELECT a_val, b_val, c_val
FROM joined_tables
WHERE some_condition
ORDER BY something_relevant
WITH ORDINALITY
RETURNING table.id, ordinality
;
The meaning being very much as before replacing 'declaration order' by 'row order of the SELECT statement as defined by the ORDER BY clause'; so pretty much like a row_number() but in the output of the RETURNING clause (and without an OVER modification). I added the ORDER BY clause as I don't really see what this would mean without it; but this (presumably) does not affect output order only the order of the incoming rows (and hence the generation of the ordinality output).
Is that correct?
Might there be a natural syntax to label the 'ordinality' output column? Perhaps something like:
...
WITH ORDINALITY (col_name)
RETURNING table.id, col_name
;
I don't want to clash with the syntax for Table Functions.
Is it a step too far to propose allowing an additional ORDER BY clause after the RETURNING clause (a specific declaration for the query execution to assign cpu cycles; especially if the WITH ORDINALITY is not tied to output order)?
Personally, I didn't see Frederico's comment as anything to do with order; just how one could output additional values in the RETURNING clause (namely, v.num from a subexpression of the SELECT but in whatever order it comes). On the other hand, that seems a lot more complicated to me because it is not an expression in the overall SELECT feeding the INSERT, whereas the WITH ORDINALITY is a specific declaration to match input order with output order by inserting a counter.
Apologies, if I have misunderstood or invented something that's not possible!
As I understand it the proposal is to supplement the syntax to something like:
INSERT INTO table (a, b, c)
VALUES ((1,2,3), (4,5,6), ...)
WITH ORDINALITY
RETURNING table.id, ordinality
;
The meaning of which is to adjoin an ordinality column to the output reflecting the declaration order in the values clause. So an output of (not necessarily in any order):
(1001, 1)
(1003, 2)
means that table.id = 1001 was assigned to the inserted row from tuple (1,2,3) (from VALUES, because that table.id is associated to ordinality = 1) and table.id = 1003 was assigned to the inserted row from tuple (4,5,6). The output being ordered as determined by the internals of query execution (not necessarily the one shown).
Is that correct?
I presume (although, not quite so clear) that one would have:
INSERT INTO table (a, b, c)
SELECT a_val, b_val, c_val
FROM joined_tables
WHERE some_condition
ORDER BY something_relevant
WITH ORDINALITY
RETURNING table.id, ordinality
;
The meaning being very much as before replacing 'declaration order' by 'row order of the SELECT statement as defined by the ORDER BY clause'; so pretty much like a row_number() but in the output of the RETURNING clause (and without an OVER modification). I added the ORDER BY clause as I don't really see what this would mean without it; but this (presumably) does not affect output order only the order of the incoming rows (and hence the generation of the ordinality output).
Is that correct?
Might there be a natural syntax to label the 'ordinality' output column? Perhaps something like:
...
WITH ORDINALITY (col_name)
RETURNING table.id, col_name
;
I don't want to clash with the syntax for Table Functions.
Is it a step too far to propose allowing an additional ORDER BY clause after the RETURNING clause (a specific declaration for the query execution to assign cpu cycles; especially if the WITH ORDINALITY is not tied to output order)?
Personally, I didn't see Frederico's comment as anything to do with order; just how one could output additional values in the RETURNING clause (namely, v.num from a subexpression of the SELECT but in whatever order it comes). On the other hand, that seems a lot more complicated to me because it is not an expression in the overall SELECT feeding the INSERT, whereas the WITH ORDINALITY is a specific declaration to match input order with output order by inserting a counter.
Apologies, if I have misunderstood or invented something that's not possible!
PS: Sorry, I haven't yet thought how this might work with UPDATE or MERGE, but if I am on the right track with INSERT I'll give this some thought.
On Mon, 17 Apr 2023 at 18:48, John Howroyd <jdhowroyd@googlemail.com> wrote:
May I clarify the ideas being discussed so far, perhaps with a view to make a relevant proposal. My apologies if I get anything wrong or go too far.
As I understand it the proposal is to supplement the syntax to something like:
INSERT INTO table (a, b, c)
VALUES ((1,2,3), (4,5,6), ...)
WITH ORDINALITY
RETURNING table.id, ordinality
;
The meaning of which is to adjoin an ordinality column to the output reflecting the declaration order in the values clause. So an output of (not necessarily in any order):
(1001, 1)
(1003, 2)
means that table.id = 1001 was assigned to the inserted row from tuple (1,2,3) (from VALUES, because that table.id is associated to ordinality = 1) and table.id = 1003 was assigned to the inserted row from tuple (4,5,6). The output being ordered as determined by the internals of query execution (not necessarily the one shown).
Is that correct?
I presume (although, not quite so clear) that one would have:
INSERT INTO table (a, b, c)
SELECT a_val, b_val, c_val
FROM joined_tables
WHERE some_condition
ORDER BY something_relevant
WITH ORDINALITY
RETURNING table.id, ordinality
;
The meaning being very much as before replacing 'declaration order' by 'row order of the SELECT statement as defined by the ORDER BY clause'; so pretty much like a row_number() but in the output of the RETURNING clause (and without an OVER modification). I added the ORDER BY clause as I don't really see what this would mean without it; but this (presumably) does not affect output order only the order of the incoming rows (and hence the generation of the ordinality output).
Is that correct?
Might there be a natural syntax to label the 'ordinality' output column? Perhaps something like:
...
WITH ORDINALITY (col_name)
RETURNING table.id, col_name
;
I don't want to clash with the syntax for Table Functions.
Is it a step too far to propose allowing an additional ORDER BY clause after the RETURNING clause (a specific declaration for the query execution to assign cpu cycles; especially if the WITH ORDINALITY is not tied to output order)?
Personally, I didn't see Frederico's comment as anything to do with order; just how one could output additional values in the RETURNING clause (namely, v.num from a subexpression of the SELECT but in whatever order it comes). On the other hand, that seems a lot more complicated to me because it is not an expression in the overall SELECT feeding the INSERT, whereas the WITH ORDINALITY is a specific declaration to match input order with output order by inserting a counter.
Apologies, if I have misunderstood or invented something that's not possible!
On Tue, 18 Apr 2023 at 00:21, John Howroyd <jdhowroyd@googlemail.com> wrote: > > May I clarify the ideas being discussed so far, perhaps with a view to make a relevant proposal. My apologies if I getanything wrong or go too far. > > As I understand it the proposal is to supplement the syntax to something like: > > INSERT INTO table (a, b, c) > VALUES ((1,2,3), (4,5,6), ...) > WITH ORDINALITY > RETURNING table.id, ordinality > ; > > The meaning of which is to adjoin an ordinality column to the output reflecting the declaration order in the values clause. So an output of (not necessarily in any order): > (1001, 1) > (1003, 2) > means that table.id = 1001 was assigned to the inserted row from tuple (1,2,3) (from VALUES, because that table.id is associatedto ordinality = 1) and table.id = 1003 was assigned to the inserted row from tuple (4,5,6). The output beingordered as determined by the internals of query execution (not necessarily the one shown). > > Is that correct? That would work as syntax for the task of tracking what id or other server default is generated by a value clause tuple. > I presume (although, not quite so clear) that one would have: > > INSERT INTO table (a, b, c) > SELECT a_val, b_val, c_val > FROM joined_tables > WHERE some_condition > ORDER BY something_relevant > WITH ORDINALITY > RETURNING table.id, ordinality > ; > > The meaning being very much as before replacing 'declaration order' by 'row order of the SELECT statement as defined bythe ORDER BY clause'; so pretty much like a row_number() but in the output of the RETURNING clause (and without an OVERmodification). I added the ORDER BY clause as I don't really see what this would mean without it; but this (presumably)does not affect output order only the order of the incoming rows (and hence the generation of the ordinalityoutput). > > Is that correct? This would not be needed if the syntax with VALUES WITH ORDINALITY is added in sqlalchemy. So fine either way. If "WITH ORDINALITY" is a feature of VALUES this syntax would not be allowed though. I'm personally ok limiting WITH ORDINALITY only to VALUES. > Might there be a natural syntax to label the 'ordinality' output column? Perhaps something like: > > ... > WITH ORDINALITY (col_name) > RETURNING table.id, col_name > ; > > I don't want to clash with the syntax for Table Functions. > > Is it a step too far to propose allowing an additional ORDER BY clause after the RETURNING clause (a specific declarationfor the query execution to assign cpu cycles; especially if the WITH ORDINALITY is not tied to output order)? > > Personally, I didn't see Frederico's comment as anything to do with order; just how one could output additional valuesin the RETURNING clause (namely, v.num from a subexpression of the SELECT but in whatever order it comes). On theother hand, that seems a lot more complicated to me because it is not an expression in the overall SELECT feeding theINSERT, whereas the WITH ORDINALITY is a specific declaration to match input order with output order by inserting a counter. I didn't mean to suggest any particular order should be kept by insert or by returning. I was merely commenting on the David G. Johnston reply I suppose breaking the restriction that only columns present on the insertion-table can be returned is a possible option that also solves another infrequent request. > Apologies, if I have misunderstood or invented something that's not possible! Thanks for the recap. I'm hoping this can become a proposal. Best, Federico
Sorry, I may have jumped to a conclusion that's not quite correct.
On Mon, 17 Apr 2023 at 23:58, Federico <cfederico87@gmail.com> wrote:
On Tue, 18 Apr 2023 at 00:21, John Howroyd <jdhowroyd@googlemail.com> wrote:
> ...
>
> Personally, I didn't see Frederico's comment as anything to do with order; just how one could output additional values in the RETURNING clause (namely, v.num from a subexpression of the SELECT but in whatever order it comes). On the other hand, that seems a lot more complicated to me because it is not an expression in the overall SELECT feeding the INSERT, whereas the WITH ORDINALITY is a specific declaration to match input order with output order by inserting a counter.
I didn't mean to suggest any particular order should be kept by insert
or by returning. I was merely commenting on the David G. Johnston
reply
I suppose breaking the restriction that only columns present on
the insertion-table can be returned is a possible option that also
solves another infrequent request.
...
Best,
Federico
This might be a possibility. The v.num (from the original example) is accessible in the outer select, so one can a envisage a system to handle this, but at (presumably) much greater expense: preparation of the SELECT, orchestration of the output rows (some fields for INSERT some for RETURNING) and (presumably) a whole load of mem copies to RETURNING. Is this something to consider (perhaps just for feasibility while writing an initial patch based on WITH ORDINALITY)?
To put it another way, v.num is a valid expression in the overall SELECT, but still a lot more complicated (and presumably expensive).
On Tue, 18 Apr 2023 at 11:53, John Howroyd <jdhowroyd@googlemail.com> wrote: > > Sorry, I may have jumped to a conclusion that's not quite correct. Not at all, thanks for moving this along > On Mon, 17 Apr 2023 at 23:58, Federico <cfederico87@gmail.com> wrote: >> >> On Tue, 18 Apr 2023 at 00:21, John Howroyd <jdhowroyd@googlemail.com> wrote: >> > ... >> > >> > Personally, I didn't see Frederico's comment as anything to do with order; just how one could output additional valuesin the RETURNING clause (namely, v.num from a subexpression of the SELECT but in whatever order it comes). On theother hand, that seems a lot more complicated to me because it is not an expression in the overall SELECT feeding theINSERT, whereas the WITH ORDINALITY is a specific declaration to match input order with output order by inserting a counter. >> >> I didn't mean to suggest any particular order should be kept by insert >> or by returning. I was merely commenting on the David G. Johnston >> reply >> >> I suppose breaking the restriction that only columns present on >> the insertion-table can be returned is a possible option that also >> solves another infrequent request. >> >> >> ... >> Best, >> Federico > > > This might be a possibility. The v.num (from the original example) is accessible in the outer select, so one can a envisagea system to handle this, but at (presumably) much greater expense: preparation of the SELECT, orchestration of theoutput rows (some fields for INSERT some for RETURNING) and (presumably) a whole load of mem copies to RETURNING. Isthis something to consider (perhaps just for feasibility while writing an initial patch based on WITH ORDINALITY)? > > To put it another way, v.num is a valid expression in the overall SELECT, but still a lot more complicated (and presumablyexpensive). I think it's probably best to focus only on INSERT VALUES WITH ORDINALITY. Further enhancements can be added later if they seem useful. Best, Federico